Thursday, April 24, 2008

SimpleDB Full Text Search, or How to Create a Search Engine in 24 Hours

SimpleDB Primer

I've started digging into Amazon SimpleDB, which is one of these newfangled "Hashtable in the cloud" systems that allows you to scale from 1MB to 100GB datasets with relative ease. It has a rather limited data model though, here are some of the highlights of things you can bump your head against:
  • The main paradigm is you store items in domains. As of right now, domains are limited to 10GB and you are limited to 100 domains.
  • Items contain one or more attribute-value pairs. Each attribute can have multiple values, though you are limited to 256 attribute-value pairs per item. Furthermore, a given value for an attribute can be at most 1024 bytes. Within a domain, the rumor is you can store at most 250 million attribute-value pairs, across all items. Ie, if the database were a spreadsheet, you can have at most 250 million cells filled in.
  • Queries let you perform range & equality operators (including 'starts-with') comparing attribute values to constant values. For example ['name' = 'greg'] will match all items with the name attribute equal to the string 'greg'. Queries can have multiple conditions on a given attribute, and can union and intersect sets of results across different attributes. Evaluation is from left to right, though, so you want to put your filter conditions that remove the most items first; you cannot induce precedence through parentheses.
  • Queries return paged result sets of item names. Item names can then be pulled in parallel from SimpleDB to get the full set of attribute-value pairs for each item.
  • All values are compared lexicographically, so you need to pad integer values with zeroes, etc.
  • SimpleDB grants you eventual consistency. This means you can insert an item and not see it for a few seconds. We'll return to this pleasant feature below.
  • You can't sort. Seriously.
Inverted Index on SimpleDB

I'm working on a new web application, and one of its centerpieces is going to be quick, painless text search through reams of data. I have spent the last day or so taking a stab at getting a inverted index working on SimpleDB, with some success. Here's what I want to be able to do at the end of the day:

  • Query for specific terms, with AND and OR boolean conditions. (So, 'this OR that' will match items with either 'this' or 'that' in the index, etc.) AND is implicit between terms.
  • Support prefix matching, so 'app*' will match anything that has a term that beings with "app"
  • Support exact phrase matching with quotes, so '"merry go round"' (with quotes) will only match those terms in that exact order.
  • Support fields, so I can index a document (similar to Lucene) with multiple fields like "content" and "tags" and search within fields. So 'tags:oranges OR tags:apples' will search for items with the field "tags" containing the term oranges or the term apples.
  • Support 'sharding' - I may have additional attributes I want to include in the index to help limit the search before searching for a term. For example, documents may belong to a folder, and I might want to perform a text search *within* a folder.
Terms & Analyzer

To avoid re-inventing the wheel, I use Ferret's DefaultAnalyzer to tokenize my input and filter my terms. I won't dig into it here, go pick up a book on Lucene if you want to learn how to do this.

Index Approaches Considered


Disclaimer: I am a relative newb when it comes to this stuff. I am posting this because I had yet to see anyone discuss this in detail, I am sure a search guru will pwn me, do it in the comments!

The overall goal is to build an inverted index of terms that will let us quickly look up which documents contain which terms. This isn't revolutionary, but the tricky part is coming up with an index format that fits into SimpleDB and gets me all my search features. Here's a few approaches I had tried or thought about initially:

Item per Document /w Splitting

So, we create an item per document. For each term in the document, we add an attribute-value pair with the attribute=term and the value=empty string. When we run out of attribute-value pairs to use for an item (remember, we can only have 256), split it into another item. ("Paging") So, a document may span multiple items, and each item has a bunch of blank values under each of its attributes, which are named after its terms.

To find an item with a set of terms, for example, 'apple', just do a query ['apple' starts-with ''], intersected for multiple term matches. On its surface this approach seems to work but it has some problems.

First, we cannot do prefix matching, since the 'starts-with' operator only applies to the attribute values, not the attribute names. We *could* inject attributes for every substring of each term (ie, "hello" would show up as "h_", "he_", "hel_", up to "hello") but we would run out of tuple space quickly.

To facilitate exact phrase matching, we can put a certain number of N following terms into the index at each attribute, instead of the empty string. This will allow us to match exact phrases up to N terms long. For example, if we pick N = 5, we will store the next 5 terms after each term under the attribute. Since we can have multiple values, we just index the N next terms for every instance of the term. (Splitting as necessary.) Performing an exact phrase match is as simple as doing a "starts with" condition on the attribute for the first term in the phrase.

It's important we add a space to the end of the query and the value in the index though, otherwise the exact phrase match "hello world" would also match "hello worldly". If we store "hello world " (with a space) in the index and do the match with the space, we can be sure we always match the end of the term correctly. Note that we inflate our index in this case, if it's N = 5, our index could be 5 times larger than our source text. This is less of a problem though since we're using SimpleDB, which we're assuming is cheap.

To facilitate fields we can inject the field name into the attribute name. So if the field "tags" has the term "apple" we can put in an attribute "tags__apple" instead of just "tags". It is getting a little scary.

To facilitate sharding, we can stick the extra attributes to shard into the document item(s), no brainer.

This technique suffers from "AND deficiency" though, a major problem, which is outlined later.

Overall score: 6/10 (prefix matching is impossible, field matching is a pain, AND deficiency)

Item per Term /w Paging

In this approach, we create an Item for each term. Again, we split into multiple items as we run out of room in the attribute-value count. The attributes are the fields, so we don't have a ton of them ("content", "tags".) The values of the attributes are the id's of the documents where the term appears for that field. We can get away with this, because remember, attributes can have multiple values.

A big pain with this approach is that when you index a document, you first need to check if an item already exists for each term. If it does, you simply add the document id to the appropriate attribute(s). If not, create a new one (or split, if you've maxed out the current one.) You also need to keep a flag around noting which item for a term is considered the latest.

Having an update or insert approach vs. an insert only approach burns you big time with SimpleDB because of the elephant in the room: eventual consistency. Since SimpleDB makes no guarantees about when new items are going to show up, if you are going to check for an existing item and update it (as is the case here,) you can get burned if that item is not being reflected in the database yet. The way around this is to perform the operations in bulk, using an in-memory cache that is periodically flushed out to SimpleDB. You still run the risk that partially filled items will end up showing up in your index, though, since once you free some cache space if you keep indexing you may not see the new items for a few seconds.

Fields are facilitated as mentioned (they are the attributes), and we can get suffix matching by simply doing the search using starts-with on the term for the item.

Exact phrase matching is not possible without a hack. You can stick a list of the next N terms into an attribute separate from the list of source document id's, but the problem is even if you match you have no idea which of the many documents actually contains the phrase. (You just know at least *one* of them does.) SimpleDB won't guarantee the ordering of the attribute values on an item, since, as a general rule, you can't rely upon sorting or insertion order. (Otherwise the Mth term phrase match would be the Mth document id.) If you hack the id's for the documents onto the end of the terms lexically with a delimiter, however, you can pull them out afterwards. (As I said, an ugly hack.)

Finally, sharding causes a problem. Since there are many documents to an item, to introduce sharding we would have to add our sharding attributes to the item and then only include documents in the item for that item's shard. So, for example, if we have a folders attribute we want to shard against, if we have K folders, we will have up to K items for a given term since we cannot put two documents from different folders into the same item. (Otherwise, we'd get incorrect document id's if we constrained against the folder attribute.) All this complexity spills up into the "update or insert" method, so we can see this is spiraling out of hand.

This approach is also "AND deficient", outlined below.

Overall Score: 4/10 (Insert/Update technique with buffers, sharding is painful)

Item per Term per Document


This approach ends up being the simplest, really. We do not have any multiple-valued attributes. Each term for each document gets an item, so we'll have a lot of items, but at least remain sane. The attribute name is the field we saw the term in. We store the next N terms as the value as above, to support exact phrase matching. We get prefix matching for free with 'starts-with'. Sharding works by just adding attributes to the item, since there is just one document per item we don't get into any of the mess noted above. This is also a "insert only" technique, so we don't have to worry about eventual consistency woes. Huzzah!

But alas, this approach also suffers from the "AND deficiency" problem. It's finally time to explain.

Overall Score: 8/10 (AND Deficiency still stings)

AND Deficiency


To support boolean expressions, the naive approach would seem to just map "union" and "intersection" operators onto the search query, and run with it. For "OR" queries, we can get away with just "union"ing everything together, and get the set of documents. This applies to any of the above index techniques. "AND", however, is a bigger pain.

It's clear that if we have an item per term per document (the last indexing technique listed), "intersection" doesn't perform AND, since by definition there's only one term per item. (Ie, if we wanted to find documents with "apple" and "oranges", ['content'='apple'] intersection ['content'='orange'] always yields the empty set.)

Regardless of how we index, however, as long as there can be more than one item per document, we cannot rely upon the intersection operation to perform AND. Note that in all our index techniques above, we can have more than one item per document, since in the worst case we run out of our 256 attribute-value pairs per item and have to "split" the document into multiple items. Due to the 256 attribute-value pair limitation, it is impossible for us to avoid the reality that documents must span multiple items, and hence, we cannot perform AND strictly from within a SimpleDB query.

The reason the final index approach ends up being simplest, is because it embraces this limitation and lives with it while optimizing the index around all of our other use cases. It was only after banging my head against the wall with these other approaches that I swallowed deeply and came to grips to the fact that AND was just not going to be possible server side.

So, we need to do some merging on our web application server to support AND. To support AND, we simply need to just perform a set based intersection on the server instead of SimpleDB. Of course, this can get hairy if we match a lot of items for each term, but it's something we have to live with. Since we cannot sort the items on SimpleDB either, we can't perform anything similar to a merge join in memory. So, for 'apples AND oranges', we're stuck, and need to load all the id's matching apples, and then page through the result for oranges and discard those who did not show up in both. With simpleDB, there are probably ways to parallelize this in-memory intersection, and you can also be smart in how you order the fetching (ie, if there are many more apples than oranges) but I leave these as an exercise to the reader!

Conclusions

In the end, I was able to get the item per term per document index working correctly. I wrote a crappy query parser that builds a mini-AST and then recursively generates and runs each query, merging results at each node in the tree. The major pending issue is that, as stated, AND nodes result in the merging of result sets in memory on the server. (This also is necessary for OR nodes that have any AND nodes under them.) Fortunately, for my particular use case, I don't see this being much of a problem, and most of the heavy lifting is still going to be taking place on SimpleDB. I don't expect worst case to be any more than a 10000x10000 intersection, which should be pretty speedy. The nice thing about text search is provided you have a good tokenizer you can usually prune down the data set pretty well after the first term.

At the end of the day (literally, this was done over the course of 24 hours), I have a theoretically limitless textual index running out in the magical cloud, complete with shards, fields, suffix matching, and exact phrase matching, with a, albeit crappy, boolean operator implementation that works. Sweet.

Thursday, January 17, 2008

The Pain of Cross-Domain

In my last post, I talked about S3 Caching of RESTful services. The sad part about this approach is that you inevitably will run into cross-domain issues if you build your client to hit S3. So, in this post, I'll run down what techniques you can do to deal with cross-domain issues.

Cross Domain Flash
This is the easiest one. Flash natively supports cross domain communication, but only via a whitelist published on the server hosting the data. So, for example, if your Flash file is hosted at mydomain.com, and it wants to talk to myservice.com, the myservice.com host must publish a crossdomain.xml file that whitelists mydomain.com. Once whitelisted, the communication can take place. Below is an example of a crossdomain.xml file you'd place at the root of myserver.com's server to whitelist Flash widgets loaded at mydomain.com.

<?xml version="1.0"?>
<!DOCTYPE cross-domain-policy
SYSTEM "http://www.macromedia.com/xml/dtds/cross-domain-policy.dtd">
<cross-domain-policy>
<allow-access-from domain="mydomain.com" />
</cross-domain-policy>

For S3, just throw up a crossdomain.xml file with public read access for the domain you host your Flash from, and you're good to go.

Cross Domain AJAX
For Cross-Domain AJAX (communication via Javascript) you have a few different options. The one most people will point you to is to use a proxy. Specifically, you build a web service that retrieves and serves content from another domain, that you host in your own domain. I don't like this approach because it pushes more load onto your infrastructure, and the main goal behind S3 Caching is to take load off of your servers. (In addition to minimizing points of failure.)

We could dedicate a web server in our domain to just shuffle data between S3 and our clients as a proxy, but in that case, we might as well store our files locally on disk instead of using S3. S3 caching is meant to be used for scenarios where even adding a single machine to your enterprise is cost-prohibitive (for example, if you are bootstrapping), so we shouldn't consider this approach.

The solution I've settled upon is to use a library by a clever fellow named Jim Wilson called SWFHttpRequest. It provides an interface nearly identical to XMLHttpRequest, so it can be relatively easily dropped into existing Javascript frameworks. For example, to use SWFHttpRequest with Prototype, you simply add this code:
Ajax.getTransport = function() {
return Try.these(
function() {return new SWFHttpRequest()},
function() {return new XMLHttpRequest()},
function() {return new ActiveXObject('Msxml2.XMLHTTP')},
function() {return new ActiveXObject('Microsoft.XMLHTTP')}
) || false;
};
I use OpenLaszlo, and here's a bit of code that will get your DHTML Laszlo apps to use it:

if (typeof(LzHTTPLoader) != "undefined") {
// Laszlo
LzHTTPLoader.prototype.open = function (method, url, username, password) {
if (this.req) {
Debug.warn("pending request for id=%s", this.__loaderid);
}

{
try {
this.req = new SWFHttpRequest();
} catch (err) {
this.req = window.XMLHttpRequest? new XMLHttpRequest():
new ActiveXObject("Microsoft.XMLHTTP");
}
}
this.__abort = false;
this.__timeout = false;
this.requesturl = url;
this.requestmethod = method;
}
}

I basically just took the code from the Laszlo source and jacked in a try/catch that attempts to bind to an SWFHttpRequest if it is available.

The downsides? First, and foremost, this requires Flash 9. We're mostly working with rich clients here though, and Flash 9 has 95% penetration, so we'll deal. Additionally, code must be updated such that it will not begin AJAX requests until SWFHttpRequest has been loaded. Basically, you just need to hold off until the global SWFHttpRequest has been defined, and you're good to go. (Of course, you can omit this step if you don't require cross-domain AJAX on start-up.)

Cross-Domain IFrame Communication
The last (and arguably most painful) type of cross-domain communication is between frames. Tassl has several nested IFRAMEs which are used for drawing HTML over the OpenLaszlo Flash app. There are cases where I need to trigger events in the outer frame from the inner frames, which are generally cross-domain. (The inner frames will often have content loaded from S3.)

Well, the way you can do it is by using the fragment identifier. The general technique is outlined by a Dojo developer. In short: the inner frame sets the location of the outer frame to be the same as it was plus a fragment identifier which contains the data to send. The outer frame polls it's own location to check for changes in the fragment identifier. When there is a change, the outer frame assumes it came from the inner frame, and it takes the fragment out as the data.

The naive technique worked until IE7 came along, and then things got messy when you have an iframe within an iframe that need to talk to one another. Microsoft responded with an explanation. I'll re-hash the solution here.

The following set up works in IE7:

-- outer frame (hosted at facebook.com) (Frame Z)
-- app iframe (hosted at secure.tassl.com) (Frame A)
-- app content iframe (hosted at S3) (Frame B)
-- data pipe iframe (hosted at secure.tassl.com) (Frame C)

Note that this issue doesn't occur if there is no "Frame Z".

So, the innermost application iframe (Frame B) must include its own iframe for communication now. (Frame C) As opposed to before, where it could simply pass back a fragment identifier directly to frame A.

To send data from from B to Frame A, Frame B must set the location on Frame C in the same way it did before directly to Frame A. Note that it must do this via window.open(url, nameOfFrameC), instead of trying to read Frame C's location object. By using window.open, you can avoid angering IE's cross-domain security checks.

To receive the data from Frame C, Frame A must have a polling routine that does the following:
  • Checks to see if Frame C exists in the dom. If not, bail out. (Using window.document.frames (in IE) or window.frames (in Firefox/Safari/Opera))
  • Acquire a reference to Frame C using another window.open trick:
    • var iframeC = window.open("", nameOfFrameC);
  • Read and decode iframeC.document.location.hash in the same way as the pre-IE7 technique.
So, by loading Frame C in the same domain as Frame A, and using the two window.open techniques, you are able to communicate from Frame B to Frame A. It's hairy, but it works. Make sure you integration test!