July « 2007 « OpenBible.info Blog

Archive for July, 2007

Bethlehemites Live in Bethlehem

Saturday, July 14th, 2007

And Kenizzites (singular Kenizzite) are the descendants of Kenaz.

I wasn’t able to find a list that lines up the noun (Assyria) and adjective (Assyrian) forms of biblical people and places, so I made one. Download the complete list in tab-delimited format.

There are about 270 lines covering about 340 distinct word forms. The list covers all the adjective (singular) and demonym (gentilic or plural) forms that appear in the ESV Bible.

The fourth column in the file indicates whether the noun is a person, place, or something else. The category wasn’t always clear-cut, given that many place names started as people names. There wasn’t always a noun equivalent (Pharisee, for example, lacks one).

This file should assist computer programs, allowing them to map variants back onto their noun bases without having to jump through linguistic hoops.

Posted in Bible | 2 Comments »

Topical Bible Technical Notes

Monday, July 2nd, 2007

As promised, here’s how the new Topical Bible works.

The Goal

The goal was to create a new topical Bible (TB) that takes advantage of the vast array of data on the Internet, warts and all.

Seeding the Topics

I wanted the TB to reflect what people really want to know, not what a human editor thinks people want to (or should) know. (I have no problem with human editors, but the point of the TB is to see what the results are when you forego them.) That meant populating the TB with actual queries. But where to get them? The options:

Start with a few topics hand-selected by me based on research and intuition. The problem is that the number of topics would be small (probably less than 100). That number would (hopefully) increase over time, but it would grow erratically and might not attract enough people to the TB to make it useful. Further, the wording of the topics would reflect my biases.
Start with the topics from a public-domain Nave’s Topical Bible. The main problem is that a lot of the “topics” are obscure, often just names of biblical people. Further, English has changed in the 100 years since Nave published his TB; how many people search for Bible verses about abstemiousness? So I’d either have to clean up the topics or live with a lot of irrelevant topics.
Ask for permission to license a topic list from the publisher of a newer TB or from another TB website. They might be willing to share their list. But again, the problem is that the topics wouldn’t reflect what people really want to know about.
Use a search engine API to generate the topics.

In the end, I combined the Yahoo Related Suggestion API and Firefox’s auto-complete feature (completing the phrase “What does the Bible say about…”) to create a list of about 4,000 topics.

Some of the resulting topics had typos, so I ran them all through the Yahoo Spelling Suggestion API, which turned up 176 misspelled topics (and a few misses—no, I really did mean “caring for widows,” not “caring for windows”).

Getting Related Verses

The next step was to get the verses. I used the Yahoo Web Search API to get the top thirty webpages related to each topic and then extracted the verse references from each page.

Daniel Foster from Logos rightly points out that extracting Bible verses from webpages is “fraught with perils.” Thankfully, it doesn’t have to be perfect; it just has to be good enough.

The ESV folks have published the Bible-book abbreviation list they use on their site. I started with the abbreviations in that file, then built up regular expressions to find only fairly definite Bible references. (For example, references to some person named Matt shouldn’t match a Bible reference.)

I stripped the HTML from the webpage and did some more normalizing, then went through each of the abbreviations (the actual code is a little more complex, but you get the idea):

my $ref = '\d{1,3}(?:[.:]\d{1,3})?(?:\s?[\&\-]\s?\d{1,3}(?:[.:]\d{1,3})?)?[ab]?'; #chapter/verse references my $verse = "$ref(?:\\s?[,;]\\s?(?:$ref))*"; #multiple verse references foreach my $abbrev (@abbrevs) { my $regex = "\\b$abbrev\\.? ?$verse\\b"; #go through each abbreviation while (/$regex/) {…} }

Next I figured out exactly which verses each reference refers to. For example, the string “Genesis 1,2” really means “Genesis 1:1-31 and Genesis 2:1-25.” The ESV API’s getQueryInfo method figures out everything for me. Why should I write a bunch of reference-parsing code when someone’s done all the hard work?

So I did an ESV API query for each reference, caching the results so identical references in the future don’t require an API lookup.

Collating the Verses

Once I retrieved all the verses for a topic, it was just a matter of looking for patterns among all the webpages. The algorithm was pretty simple: each page got one vote per unique verse—so two references to John 1:1 on the same page would only count as one vote. All verses that appeared on two or more webpages made it into the main TB index.

Sweetening the Relevance

In the end, I was able to use some of the topics from Nave’s work. About 750 of the topics occurred in both the new TB and in Nave’s; every verse for each topic in Nave’s got an extra three votes in the new TB. So, for example, 1 Corinthians 15:45 originally had eight votes under the topic of Adam. The mention in Nave’s added another three votes, bringing the vote total to eleven votes (to start with).

Displaying the Bible Verses

I use the ESV API to display the Bible verses, with heavy caching. Only five verses appear when the reference is to an extended passage, though the link points to the complete passage on the ESV site.

A recent addition to the UI is a “related topics” link—a reverse TB that lets you see the all the topics for a given verse in tag-cloud format. Entering a verse reference into the search box takes you to the same page. For example, here are all the topics for Galatians 5:13.

Ongoing Topics

The site follows a simplified version of the above procedure for new words: anytime someone searches for a word that doesn’t exist, the site goes out and finds relevant pages, parses them, and displays the results.

I originally wrote a nice multithreaded Perl script to fetch new topics quickly, but Dreamhost (my hosting company) kills it when it runs, so now everything happens serially—thus slowly.

Weaknesses

It works pretty well for popular topics. My favorite user-created topic is Christian Hedonism, a phrase popularized by John Piper; the TB did particularly well for this topic.

It doesn’t work so well for obscure topics or topics about which the Bible doesn’t really have anything directly relevant. More sophisticated algorithms might be able to correct this deficiency to an extent, but I’m not sure what form those algorithms would take.

Daniel from Logos rightly points out that the TB should really be “What people say the Bible says about….” I find some of the verses chosen for topics personally offensive. Do you really want to tell people with eating disorders that the weak person eats only vegetables, and not to judge them if they abstain from eating?

The main danger is in taking verses out of context; I’m afraid that someone will read a verse and act on it without considering what the verse means in context.

Results

Since launching three weeks ago, people have voted up or down 3,000 verses and suggested 200 new verses, in addition to creating 500 new topics.

Why?

The recent book Everything Is Miscellaneous inspired the TB. The book has a number of implications for Bible study; I may blog about it in the future. I wanted to apply some of the book’s ideas to the Bible, and the TB seemed like an easy way to do it.

In all, it took about three weekends of part-time work to create, somewhat less than the fourteen years Orville Nave spent on his famous TB. Collecting the data was the easy part; coding the frontend took several days.

The Future

Daniel from Logos has a few suggestions about letting people tie together topics instead of creating new ones for each query. I can (and do) edit the database to reduce some of the redundancy. I’ve given some thought about the UI and backend for such a system and how to prevent people from manipulating it, but an implementation remains in the future. I’d also like to spell-check new topics and let people correct the spelling instead of automatically creating the topics.

It would be helpful for people to be able to tag verses with topics when looking at a verse’s tag cloud. The main issue is how to flow the topics back into the topical index.

I’m also not entirely satisfied with how you have to know the exact verse reference to suggest a verse for a topic. Ideally, you’d also be able to enter a few keywords and get back relevant verses.

Overall, however, the TB has performed well so far.

Posted in Topics | Comments Off on Topical Bible Technical Notes

Blog