Coding Thriller: December 2007

One of the best ways to speed up a web application is to introduce some form of caching. Fragment caching on app servers, object caching using memcached, and asset caching on S3 are all good tricks to take some of the load off your servers. In this post I'm going to talk about how to cache web services on amazon S3 dynamically, creating an intermediate cache between your web browser and the app servers.

Client-First Development: Revisited

The main point of the previous post was to show how you can do Behavior Driven Development with a RIA (in AJAX, Flex, or OpenLaszlo) by stubbing out a REST service on your filesystem. This works great with Rails, because your files can be served as static fixtures you can quickly mold your REST service's resources to meet the demands for your client. You can write the client before the server, which results in less programming since you "get it right" on the server on the first try.

Well, there's another little trick we can do once you've seen this in action.

S3 as a RESTful Cache

Amazon S3 allows you to throw files up into "buckets" (using a RESTful service, no less) which can then be retrieved at a URL. A file named "houses/123/location.xml" in the bucket "cache.codingthriller.com" can be retrieved at http://cache.codingthriller.com/houses/123/location.xml, if you just configure your DNS to point cache.codingthriller.com to s3.amazonaws.com.

Are you thinking what I'm thinking? That's right, we can "stub" out our RESTful service on S3 just like we did on the file system, since the pattern is exactly the same. The difference is this time instead of it being fixtures for testing, S3 will actually store a cached version of our data. This can be a *huge* win, since all requests to S3 will skip over our server farm altogether, going to the "infinitely" scalable Amazon data center!

How the RESTful Cache Works

Having a rich client in AJAX or Flash is key for this to work smoothly. If you did client-first development, you already have code in place that makes it possible for you to "point" your client to a REST service. When building it, you pointed it to the static fixture files. When the server is running, you point it to the server to get real, dynamic data.

Since you've abstracted it, it's not too much of a leap to just point it to your S3 cache. If you designed your resources correctly, there shouldn't be any major problems with getting the same data from S3 vs. the application servers.

So, for every request to the REST service in your RIA, you should break it up into two steps. Say we are looking for "/houses/123/location.xml":

Check http://cache.codingthriller.com/houses/123/location.xml. If we get it, great.
If we 404, go to http://codingthriller.com/houses/123/location.xml. (This is our real server.)

That first request is generally pretty snappy, since it's hitting Amazon, and if we managed to get that data, we've avoided a full request to our Rails app. Money!

Pushing it out

Of course, this seems great and all, but how does the data get to S3? It's probably not the best idea to block the request until the data is pushed, so I've built an off-line service that goes through and pushes the data using a simple queue.

I created a model class called S3Push that stores the uri and the data. When a request comes in that I want to cache, I have an after_filter that pulls the body from the response and stores it into the database. Then I have a simple service that just goes through these rows and pushes them over to S3 into the right bucket.

There's a lot of details involved, but the main point is that once a request comes in, it will eventually propagate to a identical URL on S3!

Invalidation

Of course, it's important to invalidate this cache -- you don't want to serve up stale data. There are two types of invalidation: direct invalidation and timeout invalidation.

On the server side, if a request is made that affects resources in the S3 cache, you can just submit a DELETE request to S3 to remove the data. For example, if the location of a house changes via an HTTP PUT to the resource, you just can DELETE the resource from S3. Once another request comes in for that resource, it will enqueue it to be re-cached.

Timeout invalidation is a bit trickier.

Timeout Invalidation

If you are able to always invalidate the cache correctly, then you are done. However, sometimes you can't always be sure when something is invalidated. (For example, data you pulled from an external resource may be changed without you being notified.)

One way of taking care of this problem is to have the server periodically remove data from S3 that it thinks might be stale, using a cron job or background service. This is a perfectly legitimate way to do things.

However, this could introduce a 'single point of failure' in your web server farm. If you have one machine periodically cleaning out the S3 cache, if it dies, stale data could be "stuck" indefinitely. This is a different problem than having your "push to S3" service die, since in that case you simply lose the performance benefits of the cache. Painful for your servers, yes, but probably not a show-stopper.

Client-Based Invalidation

So, the approach I took was a client-centric one. While the server still has the final say when something is DELETE'd from S3, I try to take advantage of the rich clients I have running within all my users' browsers.

For this to work, the algorithm has to change a bit. (This change becomes useful later, when we introduce security!) For resource "/houses/123/location.xml", we now:

Check for "http://cache.codingthriller.com/houses/123/location.xml.cinf"

This is our Cache Metadata, stored on S3

If we found it, check the metadata for expiration:

If the cache entry has expired

Send DELETE => http://www.codingthriller.com/cinf/houses/123/location.xml.cinf

Causes a DELETE from S3 if it's really expired!

Do GET => http://www.codingthriller.com/houses/123/location.xml

Causes a push to the cache!

If not, look in the metadata for the URI of the cached data, (say, ba48f927.xml)

Do GET => http://cache.codingthriller.com/cache/07-02-2008/ba48f927.xml

Is a cached hit! We never talked to our real servers.

If there is no cached metadata:

Do GET => http://www.codingthriller.com/houses/123/location.xml

Causes a push to the cache!

So, the algorithm has gotten a bit more complex on the client side. We now have an intermediate "cinf file" that stores cache metadata: the time the data expires, as well as a SHA hash key.

If the cache has expired we submit a DELETE to the real server under the /cinf resource, which will then perform an S3 DELETE if the item is truly expired. Note that now, we can invalidate the cached item just by DELETE'ing the .cinf from S3, since clients will cause a re-push if there is no metadata. We then do the GET to the real server. If it didn't expire we use the SHA hash key and go to a URI under /cache to grab the data at that key.

So in the end, you implement this increased complexity on the client side and also need to add a cinf_controller that will delete the item from S3. What's nice is in AJAX or Flash you can perform the DELETE asynchronously, so your client does not feel any of the pain of waiting for the server to do the DELETE on S3.

Also, the "push to S3" server needs to be updated to generate the .cinf file and push it to S3 in addition to pushing the data. In that .cinf file you just include a unique hash and a timestamp. It can be useful to store the cached data in folders named by day, so you can quickly wipe out old data, as well. (As seen above in /cache/07-02-2008/ba48f927.xml)

Secure Resources

It's often the case that certain resource are only accessible by certain users. Your controllers might return a 'Access Denied' response, for example, for "/houses/123/location.xml" unless the requestor has logged in as the owner of the house. We can keep this security enforcement in our S3 cache, as well.

As noted above, the cached data now resides in a SHA hash keyed resource. This SHA hash can effectively serve as a "password" for the resource. A decent way to generate this hash is to salt the URI of the resource being cached. So, we'd hash "/houses/123/location.xml" + some random string, and that is where the data would get stored. This hash is included in the .cinf metadata file, so the client knows where to go get it.

But, we can do something better. If we split data accessibility into "security roles", and assign a unique key to each role, we can secure these cached resources. When the client starts up, you pass in information for the security roles of the logged in user. For example, an administrator role might have the key "abc123", which will be given to all clients logged in as administrators. It's important that these keys be transmitted over https, and not be persisted in cookies!

Now, when it comes time to push the data to S3, instead of pushing it to the hash, we push multiple copies of it, one for each security role. For example, if administrators can see this resource, we take the original hash we were going to store the data at, and salt it with the key for the administrator role. It now becomes impossible for a client which does not know this key to find the data on S3!

So, our metadata now includes three pieces of information*:

The expiration time
The SHA hash key for the data
The names of the roles which can see the data

And, when the client starts up, it receives (over HTTPS) all the security roles the user is a member of and their corresponding keys. Once the client sees a role it recognized in the metadata, it can then salt the SHA hash key and re-SHA it with that role key, and it is guaranteed to find the data at that key. It's also important that security role keys are regularly expired.

I'm no cryptologist, so I'd love to hear any feedback on how this technique can be exploited!

Conclusion

First we started with a simple S3 cache that pushed our RESTful service data out to S3 in the background. The client was updated to check the cache first, and then fall back on the server. Then, for invalidation, we introduced a metadata .cinf resource that the client checks first to ensure the data is not stale (and, importantly, tells the server when it sees expired data.) Finally, by storing the data at a salted hash referred in the .cinf file, and re-salting with security keys, we introduced role based security to our S3 cache that made it possible to cache privileged resources.

In the end, once implemented this caching technique can be almost entirely transparent. In my controllers, I simply pepper my methods with the correct invalidation calls, and can mark certain actions as being S3 cacheable. The back-end implementation and client take care of the rest. It's always important to integration test things like this to make sure your invalidation calls actually work!

I have yet to scale up with this solution, but my initial tests show that many, many RESTful service calls for my own application will be routed to S3 instead of my EC2 instances, a big win!

*I'd imagine there are slightly better ways to implement the cache metadata using HTTP headers -- unfortunately I cannot access HTTP headers in OpenLaszlo, so I went with a full HTTP body based approach here.

AJAX = Client-Server
AJAX is all the rage these days, but web applications written in AJAX are really nothing more than a version of the client/server pattern. Unlike traditional ("Web 1.0") webapp development, developers now have to be more conscious about the distinct roles of the AJAX client and the HTTP web server. Web servers now often take on the role of just serving static content and raw data for the rich client to push onto the user interface.

So, developers constantly switch back and forth between client and server code. These two separate codebases are often written in different languages and have their own engineering constraints. Both require extensive testing, documentation, and refactoring.

YAGNI
A core principle of agile development is that "You aren't Gonna Need It" (YAGNI). That is, writing software that has no immediate purpose but "may" be useful soon is not worth writing. Code should only be written for a clear and imminent need.

This is a powerful principal, and can be applied to many aspects of development beyond coding. Instead of switching tools or platforms, for example, it's important to ask yourself if there is a clear immediate need!

YAGNI and Mock Objects
A increasingly popular technique for doing Test-Driven Development involves Mock Objects. Summarized, by developing software from the "top-down", mocking out the innards of yet-to-be-written classes, you often will write working code with less bugs. Really, this is yet another consequence of remembering YAGNI! By mocking out dependent parts of the system that have not yet been implemented, you end up with just what you *actually* need instead of extra things you *think* you may have needed.

This is a really condensed explanation, I'd urge you to read more about this via the Endo-Testing Paper and the BDD wiki.

YAGNI meets AJAX
When you apply the same thinking of YAGNI and Mock Objects to the client-server model of AJAX, you realize there's a very big thing that might become a Mock Object: the HTTP server!

"You're not going to need the server?" you say? Well, that's not totally true, but we shouldn't be forced to write nearly as much server code as we do.

An AJAX (or Flex, Silverlight, or OpenLaszlo) app needs to have a server running behind the scenes to pull data from. This can muck up iterative development; for large features we have to work from the bottom-up: modify the database schema, update the model tier, update the controller tier, update the views. This is a lot of time spent just to get to the point where we can try to "see" the new feature in the client.

Much like Mock Object testing lets us mock away objects that don't yet exist, what if we could mock the non-existent server while we write the client? We could then iteratively build and test client features without having to dive into server code. This is the same motivation for Mock Objects, and follows from YAGNI -- why bother writing code for the server before we are sure we need it?

Making a Mock of the Server
So, how do you mock a web server? You could, for example, write a little ruby program that responds to HTTP requests with fixture data. But, how do you translate the calls in your client to go to the fixture data? What if there are complex query parameters? Complexity creeps in; building a mock web server at first glance seems to be not worth the effort.

Well, we've got one more trick up our sleeve: REST.

What is REST?
REST is a style of building web services. I'm not going to explain it in depth here (read this book!) but I will mention what aspects of it are important to understand here.

The first element of a RESTful service that is important is that URLs are truly stateless. The service can not rely upon cookies, server state, or even authentication information to determine the content of a URL. Each URL corresponds to one or more representations of a resource. A resource could be "the list of names invited to a party" and the representation could be "the XML format needed for import into Outlook."

In REST, no matter who looks at the URL or what they loaded beforehand, the data at a URL doesn't change until the resource does. For example, /parties/123/invites.xml would have the same content regardless of who or when it is accessed, until the invitations themselves change.

Query parameter variables become rare as you build a RESTful service. URLs become richer and structured around the resources your service is responsible for.

Secondly, REST limits the number of operations you can perform on a resource to GET/PUT/POST/DELETE (and occasionally HEAD.) Instead of introducing custom operations on a resource beyond these, the service should instead evolve to expose new resources. For example, if we want a way to invite a user, a non-RESTFUL design would have a HTTP POST include a variable with "command=invite". A RESTful approach would expose an "invitations" resource, which someone could POST to in order to create an invitation for a user.

There are lots of reasons you should design RESTfully, but I am going to now finally explain how it can help building client-server web applications.

Client-First Development
Now for the main observation of this post:

If you commit to designing a web service RESTfully, you can stub out the service directly on the file system. Your client can then access these static files exactly the same as the real service, until it is built.

Web servers, out of the box, will expose the file system as RESTful URLs. Now, ignoring POST/PUT/DELETE, it should be clear that you can stub out data for a truly RESTful service directly on the file system.

When you put this all together, you get what I am calling Client-First Development. By committing to REST, and building a RIA, your development process can change into the following cycle:

Decide upon a new feature/bug fix.
Add or update fixture files for the REST service on disk.
Update and unit test the client to use the new features of the REST service.
Iterate, refining fixture data (the mocked representation and resources) and the client.
Once happy with the client, write the server code & unit tests.
Write an integration test with the working server and client.

For most servers you won't need to do anything to switch from static files to dynamic data. In Rails, for example, if a controller method does not exist, it will fall back on the static file system. Once you implement the controller, it will override your static files.

In practice, it's helpful to have two local hostnames or a subdirectory called "static" that will always serve up the static data so you can freely use the client with both. When the client starts up it simply needs to be told where to get the data, the URL might point to a 'real' web service or simply the root of a static directory on the server.

Tips and Tricks

I've found this to be an incredibly useful way to develop RIAs in a YAGNI fashion. By mocking out parts of my RESTful service on the file system as I build the client I am able to prototype the client without writing server code. I write the server code once I am happy with my client code and fixture data's representation of my resources.

One final question is how do you test non-GET operations using this technique? I simply log these operations for manual review for correctness. (Don't forget, integration testing is still necessary.) The side effects of a POST/PUT/DELETE are determined by the server, so the client can usually be built and tested without relying upon them. This works for me, but YMMV.

In conclusion, I think this synergy between REST and RIA is one that can greatly decrease the amount of cruft added to web services, by forcing you to build only what you need: if the client doesn't need it, then YAGNI!

Coding Thriller

Friday, December 21, 2007

Caching RESTful Services on S3

Thursday, December 6, 2007

Client-First Development with REST

About Me