Quantifying Traditional vs. Contemporary Language in English Bibles Using Google NGram Data

Using data from Google’s new ngram corpus, here’s how English Bible translations compare in their use of traditional vs. contemporary vocabulary:

Relative Traditional vs. Contemporary Language in English Bible Translations
* Partial Bible (New Testament except for The Voice, which only has the Gospel of John). The colors represent somewhat arbitrary groups.

Here’s similar data with the most recent publication year (since 1970) as the x-axis:

Relative Traditional vs. Contemporary Language in English Bible Translations by Publication Year

Discussion

The result accords well with my expectations of translations. It generally follows the “word for word/thought for thought” continuum often used to categorize translations, suggesting that word-for-word, functionally equivalent translations tend toward traditional language, while thought-for-thought, dynamic-equivalent translations sometimes find replacements for traditional words. For reference, here’s how Bible publisher Zondervan categorizes translations along that continuum:

A word-for-word to thought-for-thought continuum lists about twenty English translations, from an interlinear to The Message.

I’m not sure what to make of the curious NLT grouping in the first chart above: the five translations are more similar than any others. In particular, I’d expect the new Common English Bible to be more contemporary–perhaps it will become so once the Old Testament is available and it’s more comparable to other translations.

In the chart with publication years, notice how no one tries to occupy the same space as the NIV for twenty years until the HCSB comes along.

The World English Bible appears where it does largely because it uses “Yahweh” instead of “LORD.” If you ignore that word, the WEB shows up between the Amplified and the NASB. (The word Yahweh has become more popular recently.) Similarly, the New Jerusalem Bible would appear between the HCSB and the NET for the same reason.

The more contemporary versions often use contractions (e.g., you’ll), which pulls their score considerably toward the contemporary side.

Religious words (“God,” “Jesus”) pull translations to the traditional side, since a greater percentage of books in the past dealt with religious subjects. A religious text such as the Bible therefore naturally tends toward older language.

If you’re looking for translations largely free from copyright restrictions, most of the KJV-grouped translations are public domain. The Lexham English Bible and the World English Bible are available in the ESV/NASB group. The NET Bible is available in the NIV group. Interestingly, all the more contemporary-style translations are under standard copyright; I don’t know of a project to produce an open thought-for-thought translation–maybe because there’s more room for disagreement in such a project?

Not included in the above chart is the LOLCat Bible, a non-academic attempt to translate the Bible into LOLspeak. If charted, it appears well to the contemporary side of The Message:

Methodology

I downloaded the English 1-gram corpus from Google, normalized the words (stripping combining characters and making them case insensitive), and inserted the five million or so unique words into a database table. I combined individual years into decades to lower the row count. Next, I ran a percentage-wise comparison (similar to what Google’s ngram viewer does) for each word to determine when they were most popular.

Then, I created word counts for a variety of translations, dropped stopwords, and multiplied the counts by the above ngram percentages to arrive at a median year for each translation.

The year scale (x-axis on the first chart, y-axis on the second) runs from 1838 to 1878, largely, as mentioned before, because Bibles use religious language. Even the LOLCat Bible dates to 1921 because it uses words (e.g., “ceiling cat”) that don’t particularly tie it to the present.

Caveats

The data doesn’t present a complete picture of a translation’s suitability for a particular audience or overall readability. For example, it doesn’t take into account word order (“fear not” vs. “do not fear”). (I wanted to use Google’s two- or three-gram data to see what differences they make, but as of this writing, Google hasn’t finished uploading them.)

I work for Zondervan, which publishes the NIV family of Bibles, but the work here is my own and I don’t speak for them.

This entry was posted on Monday, December 27th, 2010 at 12:11 pm and is filed under Bible, Linguistics, Visualizations. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

3 Responses to “Quantifying Traditional vs. Contemporary Language in English Bibles Using Google NGram Data”

Russell says:

January 14, 2011 at 9:23 am

That’s fascinating, thanks. It’s interesting how this reflects the general views of how traditional/modern those translations are, though with some surprises: I’m intrigued to see the NEB comes out as marginally more traditional than the NRSV, though there’s not much in it.

The issue of the freely licensed translations being more traditional could be very practical – translations like the WEB can start from a traditional public domain text and update it. Starting from scratch to create something like the CEV is a much bigger task. I’m curious how my own Open English Bible would fare as it attempts to be more contemporary freely licensed translation.
openbible says:

January 15, 2011 at 9:46 am

The OEB currently shows up between the NKJV and the LEB using this methodology. You can also calculate the average year of text yourself and get a report on the most significant words.
Russell says:

January 15, 2011 at 11:27 pm

Wow, thanks for doing that. That’s interesting.

Blog