Friday, December 21, 2007

Caching RESTful Services on S3

One of the best ways to speed up a web application is to introduce some form of caching. Fragment caching on app servers, object caching using memcached, and asset caching on S3 are all good tricks to take some of the load off your servers. In this post I'm going to talk about how to cache web services on amazon S3 dynamically, creating an intermediate cache between your web browser and the app servers.

Client-First Development: Revisited

The main point of the previous post was to show how you can do Behavior Driven Development with a RIA (in AJAX, Flex, or OpenLaszlo) by stubbing out a REST service on your filesystem. This works great with Rails, because your files can be served as static fixtures you can quickly mold your REST service's resources to meet the demands for your client. You can write the client before the server, which results in less programming since you "get it right" on the server on the first try.

Well, there's another little trick we can do once you've seen this in action.

S3 as a RESTful Cache

Amazon S3 allows you to throw files up into "buckets" (using a RESTful service, no less) which can then be retrieved at a URL. A file named "houses/123/location.xml" in the bucket "cache.codingthriller.com" can be retrieved at http://cache.codingthriller.com/houses/123/location.xml, if you just configure your DNS to point cache.codingthriller.com to s3.amazonaws.com.

Are you thinking what I'm thinking? That's right, we can "stub" out our RESTful service on S3 just like we did on the file system, since the pattern is exactly the same. The difference is this time instead of it being fixtures for testing, S3 will actually store a cached version of our data. This can be a *huge* win, since all requests to S3 will skip over our server farm altogether, going to the "infinitely" scalable Amazon data center!

How the RESTful Cache Works

Having a rich client in AJAX or Flash is key for this to work smoothly. If you did client-first development, you already have code in place that makes it possible for you to "point" your client to a REST service. When building it, you pointed it to the static fixture files. When the server is running, you point it to the server to get real, dynamic data.

Since you've abstracted it, it's not too much of a leap to just point it to your S3 cache. If you designed your resources correctly, there shouldn't be any major problems with getting the same data from S3 vs. the application servers.

So, for every request to the REST service in your RIA, you should break it up into two steps. Say we are looking for "/houses/123/location.xml":
  • Check http://cache.codingthriller.com/houses/123/location.xml. If we get it, great.
  • If we 404, go to http://codingthriller.com/houses/123/location.xml. (This is our real server.)
That first request is generally pretty snappy, since it's hitting Amazon, and if we managed to get that data, we've avoided a full request to our Rails app. Money!

Pushing it out

Of course, this seems great and all, but how does the data get to S3? It's probably not the best idea to block the request until the data is pushed, so I've built an off-line service that goes through and pushes the data using a simple queue.

I created a model class called S3Push that stores the uri and the data. When a request comes in that I want to cache, I have an after_filter that pulls the body from the response and stores it into the database. Then I have a simple service that just goes through these rows and pushes them over to S3 into the right bucket.

There's a lot of details involved, but the main point is that once a request comes in, it will eventually propagate to a identical URL on S3!

Invalidation

Of course, it's important to invalidate this cache -- you don't want to serve up stale data. There are two types of invalidation: direct invalidation and timeout invalidation.

On the server side, if a request is made that affects resources in the S3 cache, you can just submit a DELETE request to S3 to remove the data. For example, if the location of a house changes via an HTTP PUT to the resource, you just can DELETE the resource from S3. Once another request comes in for that resource, it will enqueue it to be re-cached.

Timeout invalidation is a bit trickier.

Timeout Invalidation

If you are able to always invalidate the cache correctly, then you are done. However, sometimes you can't always be sure when something is invalidated. (For example, data you pulled from an external resource may be changed without you being notified.)

One way of taking care of this problem is to have the server periodically remove data from S3 that it thinks might be stale, using a cron job or background service. This is a perfectly legitimate way to do things.

However, this could introduce a 'single point of failure' in your web server farm. If you have one machine periodically cleaning out the S3 cache, if it dies, stale data could be "stuck" indefinitely. This is a different problem than having your "push to S3" service die, since in that case you simply lose the performance benefits of the cache. Painful for your servers, yes, but probably not a show-stopper.

Client-Based Invalidation

So, the approach I took was a client-centric one. While the server still has the final say when something is DELETE'd from S3, I try to take advantage of the rich clients I have running within all my users' browsers.

For this to work, the algorithm has to change a bit. (This change becomes useful later, when we introduce security!) For resource "/houses/123/location.xml", we now:
  • Check for "http://cache.codingthriller.com/houses/123/location.xml.cinf"
    • This is our Cache Metadata, stored on S3
  • If we found it, check the metadata for expiration:
    • If the cache entry has expired
      • Send DELETE => http://www.codingthriller.com/cinf/houses/123/location.xml.cinf
        • Causes a DELETE from S3 if it's really expired!
      • Do GET => http://www.codingthriller.com/houses/123/location.xml
        • Causes a push to the cache!
    • If not, look in the metadata for the URI of the cached data, (say, ba48f927.xml)
      • Do GET => http://cache.codingthriller.com/cache/07-02-2008/ba48f927.xml
        • Is a cached hit! We never talked to our real servers.
  • If there is no cached metadata:
    • Do GET => http://www.codingthriller.com/houses/123/location.xml
      • Causes a push to the cache!
So, the algorithm has gotten a bit more complex on the client side. We now have an intermediate "cinf file" that stores cache metadata: the time the data expires, as well as a SHA hash key.

If the cache has expired we submit a DELETE to the real server under the /cinf resource, which will then perform an S3 DELETE if the item is truly expired. Note that now, we can invalidate the cached item just by DELETE'ing the .cinf from S3, since clients will cause a re-push if there is no metadata. We then do the GET to the real server. If it didn't expire we use the SHA hash key and go to a URI under /cache to grab the data at that key.

So in the end, you implement this increased complexity on the client side and also need to add a cinf_controller that will delete the item from S3. What's nice is in AJAX or Flash you can perform the DELETE asynchronously, so your client does not feel any of the pain of waiting for the server to do the DELETE on S3.

Also, the "push to S3" server needs to be updated to generate the .cinf file and push it to S3 in addition to pushing the data. In that .cinf file you just include a unique hash and a timestamp. It can be useful to store the cached data in folders named by day, so you can quickly wipe out old data, as well. (As seen above in /cache/07-02-2008/ba48f927.xml)

Secure Resources

It's often the case that certain resource are only accessible by certain users. Your controllers might return a 'Access Denied' response, for example, for "/houses/123/location.xml" unless the requestor has logged in as the owner of the house. We can keep this security enforcement in our S3 cache, as well.

As noted above, the cached data now resides in a SHA hash keyed resource. This SHA hash can effectively serve as a "password" for the resource. A decent way to generate this hash is to salt the URI of the resource being cached. So, we'd hash "/houses/123/location.xml" + some random string, and that is where the data would get stored. This hash is included in the .cinf metadata file, so the client knows where to go get it.

But, we can do something better. If we split data accessibility into "security roles", and assign a unique key to each role, we can secure these cached resources. When the client starts up, you pass in information for the security roles of the logged in user. For example, an administrator role might have the key "abc123", which will be given to all clients logged in as administrators. It's important that these keys be transmitted over https, and not be persisted in cookies!

Now, when it comes time to push the data to S3, instead of pushing it to the hash, we push multiple copies of it, one for each security role. For example, if administrators can see this resource, we take the original hash we were going to store the data at, and salt it with the key for the administrator role. It now becomes impossible for a client which does not know this key to find the data on S3!

So, our metadata now includes three pieces of information*:
  • The expiration time
  • The SHA hash key for the data
  • The names of the roles which can see the data
And, when the client starts up, it receives (over HTTPS) all the security roles the user is a member of and their corresponding keys. Once the client sees a role it recognized in the metadata, it can then salt the SHA hash key and re-SHA it with that role key, and it is guaranteed to find the data at that key. It's also important that security role keys are regularly expired.

I'm no cryptologist, so I'd love to hear any feedback on how this technique can be exploited!

Conclusion

First we started with a simple S3 cache that pushed our RESTful service data out to S3 in the background. The client was updated to check the cache first, and then fall back on the server. Then, for invalidation, we introduced a metadata .cinf resource that the client checks first to ensure the data is not stale (and, importantly, tells the server when it sees expired data.) Finally, by storing the data at a salted hash referred in the .cinf file, and re-salting with security keys, we introduced role based security to our S3 cache that made it possible to cache privileged resources.

In the end, once implemented this caching technique can be almost entirely transparent. In my controllers, I simply pepper my methods with the correct invalidation calls, and can mark certain actions as being S3 cacheable. The back-end implementation and client take care of the rest. It's always important to integration test things like this to make sure your invalidation calls actually work!

I have yet to scale up with this solution, but my initial tests show that many, many RESTful service calls for my own application will be routed to S3 instead of my EC2 instances, a big win!

*I'd imagine there are slightly better ways to implement the cache metadata using HTTP headers -- unfortunately I cannot access HTTP headers in OpenLaszlo, so I went with a full HTTP body based approach here.

No comments: