{"id":464,"date":"2011-09-01T08:03:55","date_gmt":"2011-09-01T12:03:55","guid":{"rendered":"http:\/\/www.openbible.info\/blog\/?p=464"},"modified":"2016-01-31T20:52:16","modified_gmt":"2016-02-01T00:52:16","slug":"bible-annotation-modeling-and-querying-in-mysql-and-couchdb","status":"publish","type":"post","link":"https:\/\/www.openbible.info\/blog\/2011\/09\/bible-annotation-modeling-and-querying-in-mysql-and-couchdb\/","title":{"rendered":"Bible Annotation Modeling and Querying in MySQL and CouchDB"},"content":{"rendered":"<p>If you\u2019re storing people\u2019s Bible annotations (notes, bookmarks, highlights, etc.) digitally, you want to be able to retrieve them later. Let\u2019s look at some strategies for how to store and look up these annotations.<\/p>\n<h3>Know What You\u2019re Modeling<\/h3>\n<p>First you need to understand the shape of the data. I don\u2019t have access to a large repository of Bible annotations, but the Twitter and Facebook Bible citations from the <a href=\"http:\/\/www.openbible.info\/realtime\/\">Realtime Bible Search<\/a> section of this website provide a good approximation of how people cite the Bible. (Quite a few Facebook posts appear to involve people responding to their daily devotions.) These tweets and posts are public, and private annotations may take on a slightly different form, but the general shape of the data should be similar: nearly all (99%) refer to a chapter or less.<\/p>\n<p><a href=\"https:\/\/a.openbible.info\/blog\/2011-08-social-full.png\"><img loading=\"lazy\" src=\"https:\/\/a.openbible.info\/blog\/2011-08-social.png\" width=\"600\" height=\"439\" alt=\"Large dots at the bottom indicate many single-verse references. Chapter references are also fairly prominent. See below for more discussion.\" \/><\/a><\/p>\n<p><a href=\"http:\/\/www.biblegateway.com\/blog\/2011\/08\/how-people-share-the-bible-verses-vs-read-the-bible-chapters\/\">Compare Bible Gateway reading habits<\/a>, which are much heavier on chapter-level usage, but 98% of accesses still involve a chapter or less.<\/p>\n<h3>The Numbers<\/h3>\n<p>The data consists of about 35 million total references.<\/p>\n<table class=\"data\">\n<tr>\n<th class=\"number\">Percent of Total<\/th>\n<th>Description<\/th>\n<th>Example<\/th>\n<\/tr>\n<tr>\n<td class=\"number\">73.5<\/td>\n<td>Single verse<\/td>\n<td>John 3:16<\/td>\n<\/tr>\n<tr>\n<td class=\"number\">17.1<\/td>\n<td>Verse range in a single chapter<\/td>\n<td>John 3:16-17<\/td>\n<\/tr>\n<tr>\n<td class=\"number\">8.4<\/td>\n<td>Exactly one chapter<\/td>\n<td>John 3<\/td>\n<\/tr>\n<tr>\n<td class=\"number\">0.7<\/td>\n<td>Two or more chapters (at chapter boundaries)<\/td>\n<td>John 3-4<\/td>\n<\/tr>\n<tr>\n<td class=\"number\">0.1<\/td>\n<td>Verses spanning two chapters (not at chapter boundaries)<\/td>\n<td>John 3:16-4:2<\/td>\n<\/tr>\n<tr>\n<td class=\"number\">0.1<\/td>\n<td>Verses spanning three or more chapters (not at chapter boundaries)<\/td>\n<td>John 3:16-5:2<\/td>\n<\/tr>\n<\/table>\n<p>About 92.9% of posts or tweets cited only one verse or verse range; 7.1% mentioned more than one verse range. Of the latter, 77% cited exactly two verse ranges; the highest had 323 independent verse ranges. Of Facebook posts, 9.1% contained multiple verse ranges, compared to 4.2% of tweets. When there were multiple ranges, 43% of the time they referred to verses in different books from the other ranges; 39% referred to verses in the same book (but not in the same chapter); and 18% referred to verses in the same chapter. (This distribution is a unusual\u2014normally close verses stick together.)<\/p>\n<p>The data, oddly, doesn\u2019t contain any references that span multiple books. Less than 0.01% of passage accesses span multiple books on Bible Gateway, which is probably a useful upper bound for this type of data.<\/p>\n<h4>Key Points<\/h4>\n<ol>\n<li>Nearly all citations involve verses in the same chapter; only 1% involve verses in multiple chapters.<\/li>\n<li>Of the 1% spanning two or more chapters, most refer to exact chapter boundaries.<\/li>\n<li>Multiple-book references are even more unusual (under 0.01%) but have outsize effects: an annotation that references Genesis 1 to Revelation 22 would be relevant for every verse in the Bible.<\/li>\n<li>Around 7% of notes contained multiple independent ranges of verses\u2014the more text you allow for an annotation, the more likely someone is to mention multiple verses.<\/li>\n<\/ol>\n<h4>Download<\/h4>\n<p><a href=\"https:\/\/a.openbible.info\/blog\/2011-08-social-lengths.zip\">Download the raw social data<\/a> (1.4 MB zip) under the usual CC-Attribution license.<\/p>\n<h3>Data Modeling<\/h3>\n<p>A Bible annotation consists of arbitrary content (a highlight might have one kind of content, while a proper note might have a title, body, attachments, etc., but modeling the content itself isn&#8217;t the point of this piece) tied to one or more Bible references:<\/p>\n<ol>\n<li>A single verse (John 3:16).<\/li>\n<li>A single range (John 3:16-17).<\/li>\n<li>Multiple verses or ranges (John 3:16, John 3:18-19)<\/li>\n<\/ol>\n<h3>The Relational Model<\/h3>\n<p>One user can have many rows of annotations, and one annotation can have many rows of verses that it refers to. To model a Bible annotation relationally, we set up three tables that look something like this:<\/p>\n<h4>users<\/h4>\n<table class=\"data\">\n<tr>\n<th>user_id<\/th>\n<th>name<\/th>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>\u2026<\/td>\n<\/tr>\n<\/table>\n<h4>annotations<\/h4>\n<table class=\"data\">\n<tr>\n<th>user_id<\/th>\n<th>annotation_id<\/th>\n<th>content<\/th>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>101<\/td>\n<td>\u2026<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>102<\/td>\n<td>\u2026<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>103<\/td>\n<td>\u2026<\/td>\n<\/tr>\n<\/table>\n<h4>annotation_verses<\/h4>\n<p> The verse references here are integers to allow for easy range searches: 43 = John (the 43rd book in the typical Protestant Bible); 003 = the third chapter; the last three digits = the verse number.<\/p>\n<p>I like using this approach over others (sequential integer or separate columns for book, chapter, and verse) because it limits the need for a lookup table. (You just need to know that 43 = John, and then you can find any verse or range of verses in that book.) It also lets you find all the annotations for a particular chapter without having to know how many verses are in the chapter. (The longest chapter in the Bible has 176 verses, so you know that all the verses in John 3, for example, fall between 43003001 and 43003176.) This main disadvantage is that you don\u2019t necessarily know how many verses you\u2019re selecting until after you\u2019ve selected them. And using individual columns, unlike here, does allow you to run <code>group by<\/code> queries to get easy counts.<\/p>\n<table class=\"data\">\n<tr>\n<th>annotation_id<\/th>\n<th>start_verse<\/th>\n<th>end_verse<\/th>\n<\/tr>\n<tr>\n<td>101<\/td>\n<td>43003016<\/td>\n<td>43003016<\/td>\n<\/tr>\n<tr>\n<td>102<\/td>\n<td>43003016<\/td>\n<td>43003017<\/td>\n<\/tr>\n<tr>\n<td>103<\/td>\n<td>43003016<\/td>\n<td>43003016<\/td>\n<\/tr>\n<tr>\n<td>103<\/td>\n<td>43003019<\/td>\n<td>43003020<\/td>\n<\/tr>\n<\/table>\n<h3>Querying<\/h3>\n<p>In a Bible application, the usual mode of accessing annotations is by passage: if you\u2019re looking at John 3:16-18, you want to see all your annotations that apply to that passage.<\/p>\n<h3>Querying MySQL<\/h3>\n<p>In SQL terms:<\/p>\n<p><code>select distinct(annotations.annotation_id)<br \/>\nfrom annotations, annotation_verses<br \/>\nwhere annotation_verses.start_verse &lt;= 43003018 and<br \/>\nannotation_verses.end_verse &gt;= 43003016 and<br \/>\nannotations.user_id = 1 and<br \/>\nannotations.annotation_id = annotation_verses.annotation_id<br \/>\norder by annotation_verses.start_verse asc, annotation_verses.end_verse desc<\/code><\/p>\n<p>The quirkiest part of the SQL is the first part of the \u201cwhere\u201d clause, which at first glance looks backward: why is the last verse in the <code>start_verse<\/code> field and the first verse in the <code>end_verse<\/code> field? Because the <code>start_verse<\/code> and <code>end_verse<\/code> can span any range of verses, you need to make sure that you get any range that overlaps the verses you\u2019re looking for: in other words, the <code>start_verse<\/code> is before the end of the range, and the <code>end_verse<\/code> is after the start.<\/p>\n<p>Visually, you can think of each <code>start_verse<\/code> and <code>end_verse<\/code> pair as a line: if the line overlaps the shaded area you\u2019re looking for, then it\u2019s a relevant annotation. If not, it\u2019s not relevant. There are six cases:<\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/a.openbible.info\/blog\/2011-08-before-after.png\" width=\"516\" height=\"277\" alt=\"Start before, end before: John 3:15 \/ Start before, end inside: John 3:15-17 \/ Start before, end after: John 3:15-19 \/ Start inside, end inside: John 3:16-18 \/ Start inside, end after: John 3:17-19 \/ Start after, end after: John 3:19\" \/><\/p>\n<p>The other trick in the SQL is the sort order: you generally want to see annotations in canonical order, starting with the longest range first. In other words, you start with an annotation about John 3, then to a section inside John 3, then to individual verses. In this way, you move from the broadest annotations to the narrowest annotations. You may want to switch up this order, but it makes a good default.<\/p>\n<p>The relational approach works pretty well. If you worry about the performance implications of the SQL join, you can always put the <code>user_id<\/code> in <code>annotation_verses<\/code> or use a view or something.<\/p>\n<h3>Querying CouchDB<\/h3>\n<p><a href=\"http:\/\/couchdb.apache.org\/\">CouchDB<\/a> is one of the oldest entrants in the NoSQL space and distinguishes itself by being both a key-value store and queryable using map-reduce:  the usual way to access more than one document in a single query is to write Javascript to output the data you want. It lets you create complex keys to query by, so you might think that you can generate a key like <code>[start_verse,end_verse]<\/code> and query it like this: <code>?startkey=[0,43003016]&amp;endkey=[43003018,99999999]<\/code><\/p>\n<p>But no. Views are one-dimensional, meaning that CouchDB doesn\u2019t even look at the second element in the key if the first one matches the query. For example, an annotation with both a start and end verse of <code>19001001<\/code> matches the above query, which isn\u2019t useful for this purpose.<\/p>\n<p>I can think of two ways to get around this limitation, both of which have drawbacks.<\/p>\n<h4>GeoCouch<\/h4>\n<p>CouchDB has a plugin called GeoCouch that lets you query geographic data, which actually maps well to this data model. (I didn\u2019t come up with this approach on my own: see <a href=\"http:\/\/www.diretto.org\/2010\/08\/efficient-time-based-range-queries-in-couchdb-using-geocouch\/\">Efficient Time-based Range Queries in CouchDB using GeoCouch<\/a> for the background.)<\/p>\n<p>The basic idea is to treat each <code>start_verse,end_verse<\/code> pair as a point on a two-dimensional grid. Here\u2019s the above social data plotted this way:<\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/a.openbible.info\/blog\/2011-08-social-grid.png\" width=\"600\" height=\"551\" alt=\"A diagonal line starts in the bottom left corner and continues to the top right. Large dots indicate popular verses, and book outlines are visible.\" \/><\/p>\n<p>The line bisects the grid diagonally since an <code>end_verse<\/code> never precedes a <code>start_verse<\/code>: the diagonal line where <code>start_verse = end_verse<\/code> indicates the lower bound of any reference. Here are some points indicating where ranges fall on the plot:<\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/a.openbible.info\/blog\/2011-08-social-grid-points.png\" width=\"600\" height=\"554\" alt=\"This chart looks the same as the previous one but has points marked to illustrate that longer ranges are farther away from the bisecting line.\" \/><\/p>\n<p>To find all the annotations relevant to John 3:16-18, we draw a region starting in the upper left and continuing to the point <code>43003018,43003016<\/code>:<\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/a.openbible.info\/blog\/2011-08-social-grid-bbox.png\" width=\"600\" height=\"551\" alt=\"This chart looks the same as the previous one but has a box from the top left ending just above and past the beginning of John near the upper right of the chart.\" \/><\/p>\n<p>GeoCouch allows exactly this kind of bounding-box query: <code>?bbox=0,43003016,43003018,99999999<\/code><\/p>\n<p>You can even support multiple users in this scheme: just give everyone their own, independent box. I might occupy 1&#215;1 (with an annotation at <code>1.43003016,1.43003016<\/code>), while you might occupy 2&#215;2 (with an annotation at <code>2.43003016,2.43003016<\/code>); queries for our annotations would never overlap. Each whole number to the left of the decimal acts as a namespace.<\/p>\n<p>The drawbacks:<\/p>\n<ol>\n<li>The results aren\u2019t sorted in a useful way. You\u2019ll need to do sorting on the client side or in a <a href=\"http:\/\/guide.couchdb.org\/editions\/1\/en\/show.html\">show function<\/a>.<\/li>\n<li>You don\u2019t get pagination.<\/li>\n<\/ol>\n<h4>Repetition at Intervals<\/h4>\n<p>Given the shape of the data, which is overwhelmingly chapter-bound (and lookups, which at least on Bible Gateway are chapter-based), you could simply repeat chapter-spanning annotations at the beginning of every chapter. In the worst case annotation (Genesis 1-Revelation 22), you end up with about 1200 repetitions.<\/p>\n<p>For example, in the Genesis-Revelation case, for John 3 you might create a key like <code>[43000000.01001001,66022021]<\/code> so that it sorts at the beginning of the chapter\u2014and if you have multiple annotations with different start verses, they stay sorted properly.<\/p>\n<p>To get annotations for John 3:16-18, you\u2019d query for <code>?startkey=[43003000]&amp;endkey=[43003018,{}]<\/code><\/p>\n<p>The drawbacks:<\/p>\n<ol>\n<li>You have to filter out all the irrelevant annotations: if you have a lot of annotations about John 3:14, you have to skip through them all before you get to the ones about John 3:16.<\/li>\n<li>You have to filter out duplicates when the range you\u2019re querying for spans multiple chapters.<\/li>\n<li>You\u2019re repeating yourself, though given how rarely a multi-chapter span (let alone a multi-book span) happens in the wild, it might not matter that much.<\/li>\n<\/ol>\n<h4>Other CouchDB Approaches<\/h4>\n<p>Both these approaches assume that you want to make only one query to retrieve the data. If you\u2019re willing to make multiple queries, you could create different list functions and query them in parallel: for example, you could have one for single-chapter annotations and one for multi-chapter annotations. See <a href=\"http:\/\/en.wikipedia.org\/wiki\/Interval_tree\">interval trees<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Geohash\">geohashes<\/a> for additional ideas. You could also introduce a separate query layer, such as <a href=\"http:\/\/www.elasticsearch.org\/\">elasticsearch<\/a>, to sit <a href=\"http:\/\/www.elasticsearch.org\/guide\/reference\/river\/couchdb.html\">on top of CouchDB<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you\u2019re storing people\u2019s Bible annotations (notes, bookmarks, highlights, etc.) digitally, you want to be able to retrieve them later. Let\u2019s look at some strategies for how to store and look up these annotations. Know What You\u2019re Modeling First you need to understand the shape of the data. I don\u2019t have access to a large [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[18,12,9],"tags":[],"_links":{"self":[{"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/posts\/464"}],"collection":[{"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/comments?post=464"}],"version-history":[{"count":54,"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/posts\/464\/revisions"}],"predecessor-version":[{"id":1134,"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/posts\/464\/revisions\/1134"}],"wp:attachment":[{"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/media?parent=464"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/categories?post=464"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.openbible.info\/blog\/wp-json\/wp\/v2\/tags?post=464"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}