Thursday, April 24, 2008

SimpleDB Full Text Search, or How to Create a Search Engine in 24 Hours

SimpleDB Primer

I've started digging into Amazon SimpleDB, which is one of these newfangled "Hashtable in the cloud" systems that allows you to scale from 1MB to 100GB datasets with relative ease. It has a rather limited data model though, here are some of the highlights of things you can bump your head against:
  • The main paradigm is you store items in domains. As of right now, domains are limited to 10GB and you are limited to 100 domains.
  • Items contain one or more attribute-value pairs. Each attribute can have multiple values, though you are limited to 256 attribute-value pairs per item. Furthermore, a given value for an attribute can be at most 1024 bytes. Within a domain, the rumor is you can store at most 250 million attribute-value pairs, across all items. Ie, if the database were a spreadsheet, you can have at most 250 million cells filled in.
  • Queries let you perform range & equality operators (including 'starts-with') comparing attribute values to constant values. For example ['name' = 'greg'] will match all items with the name attribute equal to the string 'greg'. Queries can have multiple conditions on a given attribute, and can union and intersect sets of results across different attributes. Evaluation is from left to right, though, so you want to put your filter conditions that remove the most items first; you cannot induce precedence through parentheses.
  • Queries return paged result sets of item names. Item names can then be pulled in parallel from SimpleDB to get the full set of attribute-value pairs for each item.
  • All values are compared lexicographically, so you need to pad integer values with zeroes, etc.
  • SimpleDB grants you eventual consistency. This means you can insert an item and not see it for a few seconds. We'll return to this pleasant feature below.
  • You can't sort. Seriously.
Inverted Index on SimpleDB

I'm working on a new web application, and one of its centerpieces is going to be quick, painless text search through reams of data. I have spent the last day or so taking a stab at getting a inverted index working on SimpleDB, with some success. Here's what I want to be able to do at the end of the day:

  • Query for specific terms, with AND and OR boolean conditions. (So, 'this OR that' will match items with either 'this' or 'that' in the index, etc.) AND is implicit between terms.
  • Support prefix matching, so 'app*' will match anything that has a term that beings with "app"
  • Support exact phrase matching with quotes, so '"merry go round"' (with quotes) will only match those terms in that exact order.
  • Support fields, so I can index a document (similar to Lucene) with multiple fields like "content" and "tags" and search within fields. So 'tags:oranges OR tags:apples' will search for items with the field "tags" containing the term oranges or the term apples.
  • Support 'sharding' - I may have additional attributes I want to include in the index to help limit the search before searching for a term. For example, documents may belong to a folder, and I might want to perform a text search *within* a folder.
Terms & Analyzer

To avoid re-inventing the wheel, I use Ferret's DefaultAnalyzer to tokenize my input and filter my terms. I won't dig into it here, go pick up a book on Lucene if you want to learn how to do this.

Index Approaches Considered

Disclaimer: I am a relative newb when it comes to this stuff. I am posting this because I had yet to see anyone discuss this in detail, I am sure a search guru will pwn me, do it in the comments!

The overall goal is to build an inverted index of terms that will let us quickly look up which documents contain which terms. This isn't revolutionary, but the tricky part is coming up with an index format that fits into SimpleDB and gets me all my search features. Here's a few approaches I had tried or thought about initially:

Item per Document /w Splitting

So, we create an item per document. For each term in the document, we add an attribute-value pair with the attribute=term and the value=empty string. When we run out of attribute-value pairs to use for an item (remember, we can only have 256), split it into another item. ("Paging") So, a document may span multiple items, and each item has a bunch of blank values under each of its attributes, which are named after its terms.

To find an item with a set of terms, for example, 'apple', just do a query ['apple' starts-with ''], intersected for multiple term matches. On its surface this approach seems to work but it has some problems.

First, we cannot do prefix matching, since the 'starts-with' operator only applies to the attribute values, not the attribute names. We *could* inject attributes for every substring of each term (ie, "hello" would show up as "h_", "he_", "hel_", up to "hello") but we would run out of tuple space quickly.

To facilitate exact phrase matching, we can put a certain number of N following terms into the index at each attribute, instead of the empty string. This will allow us to match exact phrases up to N terms long. For example, if we pick N = 5, we will store the next 5 terms after each term under the attribute. Since we can have multiple values, we just index the N next terms for every instance of the term. (Splitting as necessary.) Performing an exact phrase match is as simple as doing a "starts with" condition on the attribute for the first term in the phrase.

It's important we add a space to the end of the query and the value in the index though, otherwise the exact phrase match "hello world" would also match "hello worldly". If we store "hello world " (with a space) in the index and do the match with the space, we can be sure we always match the end of the term correctly. Note that we inflate our index in this case, if it's N = 5, our index could be 5 times larger than our source text. This is less of a problem though since we're using SimpleDB, which we're assuming is cheap.

To facilitate fields we can inject the field name into the attribute name. So if the field "tags" has the term "apple" we can put in an attribute "tags__apple" instead of just "tags". It is getting a little scary.

To facilitate sharding, we can stick the extra attributes to shard into the document item(s), no brainer.

This technique suffers from "AND deficiency" though, a major problem, which is outlined later.

Overall score: 6/10 (prefix matching is impossible, field matching is a pain, AND deficiency)

Item per Term /w Paging

In this approach, we create an Item for each term. Again, we split into multiple items as we run out of room in the attribute-value count. The attributes are the fields, so we don't have a ton of them ("content", "tags".) The values of the attributes are the id's of the documents where the term appears for that field. We can get away with this, because remember, attributes can have multiple values.

A big pain with this approach is that when you index a document, you first need to check if an item already exists for each term. If it does, you simply add the document id to the appropriate attribute(s). If not, create a new one (or split, if you've maxed out the current one.) You also need to keep a flag around noting which item for a term is considered the latest.

Having an update or insert approach vs. an insert only approach burns you big time with SimpleDB because of the elephant in the room: eventual consistency. Since SimpleDB makes no guarantees about when new items are going to show up, if you are going to check for an existing item and update it (as is the case here,) you can get burned if that item is not being reflected in the database yet. The way around this is to perform the operations in bulk, using an in-memory cache that is periodically flushed out to SimpleDB. You still run the risk that partially filled items will end up showing up in your index, though, since once you free some cache space if you keep indexing you may not see the new items for a few seconds.

Fields are facilitated as mentioned (they are the attributes), and we can get suffix matching by simply doing the search using starts-with on the term for the item.

Exact phrase matching is not possible without a hack. You can stick a list of the next N terms into an attribute separate from the list of source document id's, but the problem is even if you match you have no idea which of the many documents actually contains the phrase. (You just know at least *one* of them does.) SimpleDB won't guarantee the ordering of the attribute values on an item, since, as a general rule, you can't rely upon sorting or insertion order. (Otherwise the Mth term phrase match would be the Mth document id.) If you hack the id's for the documents onto the end of the terms lexically with a delimiter, however, you can pull them out afterwards. (As I said, an ugly hack.)

Finally, sharding causes a problem. Since there are many documents to an item, to introduce sharding we would have to add our sharding attributes to the item and then only include documents in the item for that item's shard. So, for example, if we have a folders attribute we want to shard against, if we have K folders, we will have up to K items for a given term since we cannot put two documents from different folders into the same item. (Otherwise, we'd get incorrect document id's if we constrained against the folder attribute.) All this complexity spills up into the "update or insert" method, so we can see this is spiraling out of hand.

This approach is also "AND deficient", outlined below.

Overall Score: 4/10 (Insert/Update technique with buffers, sharding is painful)

Item per Term per Document

This approach ends up being the simplest, really. We do not have any multiple-valued attributes. Each term for each document gets an item, so we'll have a lot of items, but at least remain sane. The attribute name is the field we saw the term in. We store the next N terms as the value as above, to support exact phrase matching. We get prefix matching for free with 'starts-with'. Sharding works by just adding attributes to the item, since there is just one document per item we don't get into any of the mess noted above. This is also a "insert only" technique, so we don't have to worry about eventual consistency woes. Huzzah!

But alas, this approach also suffers from the "AND deficiency" problem. It's finally time to explain.

Overall Score: 8/10 (AND Deficiency still stings)

AND Deficiency

To support boolean expressions, the naive approach would seem to just map "union" and "intersection" operators onto the search query, and run with it. For "OR" queries, we can get away with just "union"ing everything together, and get the set of documents. This applies to any of the above index techniques. "AND", however, is a bigger pain.

It's clear that if we have an item per term per document (the last indexing technique listed), "intersection" doesn't perform AND, since by definition there's only one term per item. (Ie, if we wanted to find documents with "apple" and "oranges", ['content'='apple'] intersection ['content'='orange'] always yields the empty set.)

Regardless of how we index, however, as long as there can be more than one item per document, we cannot rely upon the intersection operation to perform AND. Note that in all our index techniques above, we can have more than one item per document, since in the worst case we run out of our 256 attribute-value pairs per item and have to "split" the document into multiple items. Due to the 256 attribute-value pair limitation, it is impossible for us to avoid the reality that documents must span multiple items, and hence, we cannot perform AND strictly from within a SimpleDB query.

The reason the final index approach ends up being simplest, is because it embraces this limitation and lives with it while optimizing the index around all of our other use cases. It was only after banging my head against the wall with these other approaches that I swallowed deeply and came to grips to the fact that AND was just not going to be possible server side.

So, we need to do some merging on our web application server to support AND. To support AND, we simply need to just perform a set based intersection on the server instead of SimpleDB. Of course, this can get hairy if we match a lot of items for each term, but it's something we have to live with. Since we cannot sort the items on SimpleDB either, we can't perform anything similar to a merge join in memory. So, for 'apples AND oranges', we're stuck, and need to load all the id's matching apples, and then page through the result for oranges and discard those who did not show up in both. With simpleDB, there are probably ways to parallelize this in-memory intersection, and you can also be smart in how you order the fetching (ie, if there are many more apples than oranges) but I leave these as an exercise to the reader!


In the end, I was able to get the item per term per document index working correctly. I wrote a crappy query parser that builds a mini-AST and then recursively generates and runs each query, merging results at each node in the tree. The major pending issue is that, as stated, AND nodes result in the merging of result sets in memory on the server. (This also is necessary for OR nodes that have any AND nodes under them.) Fortunately, for my particular use case, I don't see this being much of a problem, and most of the heavy lifting is still going to be taking place on SimpleDB. I don't expect worst case to be any more than a 10000x10000 intersection, which should be pretty speedy. The nice thing about text search is provided you have a good tokenizer you can usually prune down the data set pretty well after the first term.

At the end of the day (literally, this was done over the course of 24 hours), I have a theoretically limitless textual index running out in the magical cloud, complete with shards, fields, suffix matching, and exact phrase matching, with a, albeit crappy, boolean operator implementation that works. Sweet.