Coding Thriller: 2007

Friday, December 21, 2007

Caching RESTful Services on S3

One of the best ways to speed up a web application is to introduce some form of caching. Fragment caching on app servers, object caching using memcached, and asset caching on S3 are all good tricks to take some of the load off your servers. In this post I'm going to talk about how to cache web services on amazon S3 dynamically, creating an intermediate cache between your web browser and the app servers.

Client-First Development: Revisited

The main point of the previous post was to show how you can do Behavior Driven Development with a RIA (in AJAX, Flex, or OpenLaszlo) by stubbing out a REST service on your filesystem. This works great with Rails, because your files can be served as static fixtures you can quickly mold your REST service's resources to meet the demands for your client. You can write the client before the server, which results in less programming since you "get it right" on the server on the first try.

Well, there's another little trick we can do once you've seen this in action.

S3 as a RESTful Cache

Amazon S3 allows you to throw files up into "buckets" (using a RESTful service, no less) which can then be retrieved at a URL. A file named "houses/123/location.xml" in the bucket "cache.codingthriller.com" can be retrieved at http://cache.codingthriller.com/houses/123/location.xml, if you just configure your DNS to point cache.codingthriller.com to s3.amazonaws.com.

Are you thinking what I'm thinking? That's right, we can "stub" out our RESTful service on S3 just like we did on the file system, since the pattern is exactly the same. The difference is this time instead of it being fixtures for testing, S3 will actually store a cached version of our data. This can be a *huge* win, since all requests to S3 will skip over our server farm altogether, going to the "infinitely" scalable Amazon data center!

How the RESTful Cache Works

Having a rich client in AJAX or Flash is key for this to work smoothly. If you did client-first development, you already have code in place that makes it possible for you to "point" your client to a REST service. When building it, you pointed it to the static fixture files. When the server is running, you point it to the server to get real, dynamic data.

Since you've abstracted it, it's not too much of a leap to just point it to your S3 cache. If you designed your resources correctly, there shouldn't be any major problems with getting the same data from S3 vs. the application servers.

So, for every request to the REST service in your RIA, you should break it up into two steps. Say we are looking for "/houses/123/location.xml":

Check http://cache.codingthriller.com/houses/123/location.xml. If we get it, great.
If we 404, go to http://codingthriller.com/houses/123/location.xml. (This is our real server.)

That first request is generally pretty snappy, since it's hitting Amazon, and if we managed to get that data, we've avoided a full request to our Rails app. Money!

Pushing it out

Of course, this seems great and all, but how does the data get to S3? It's probably not the best idea to block the request until the data is pushed, so I've built an off-line service that goes through and pushes the data using a simple queue.

I created a model class called S3Push that stores the uri and the data. When a request comes in that I want to cache, I have an after_filter that pulls the body from the response and stores it into the database. Then I have a simple service that just goes through these rows and pushes them over to S3 into the right bucket.

There's a lot of details involved, but the main point is that once a request comes in, it will eventually propagate to a identical URL on S3!

Invalidation

Of course, it's important to invalidate this cache -- you don't want to serve up stale data. There are two types of invalidation: direct invalidation and timeout invalidation.

On the server side, if a request is made that affects resources in the S3 cache, you can just submit a DELETE request to S3 to remove the data. For example, if the location of a house changes via an HTTP PUT to the resource, you just can DELETE the resource from S3. Once another request comes in for that resource, it will enqueue it to be re-cached.

Timeout invalidation is a bit trickier.

Timeout Invalidation

If you are able to always invalidate the cache correctly, then you are done. However, sometimes you can't always be sure when something is invalidated. (For example, data you pulled from an external resource may be changed without you being notified.)

One way of taking care of this problem is to have the server periodically remove data from S3 that it thinks might be stale, using a cron job or background service. This is a perfectly legitimate way to do things.

However, this could introduce a 'single point of failure' in your web server farm. If you have one machine periodically cleaning out the S3 cache, if it dies, stale data could be "stuck" indefinitely. This is a different problem than having your "push to S3" service die, since in that case you simply lose the performance benefits of the cache. Painful for your servers, yes, but probably not a show-stopper.

Client-Based Invalidation

So, the approach I took was a client-centric one. While the server still has the final say when something is DELETE'd from S3, I try to take advantage of the rich clients I have running within all my users' browsers.

For this to work, the algorithm has to change a bit. (This change becomes useful later, when we introduce security!) For resource "/houses/123/location.xml", we now:

Check for "http://cache.codingthriller.com/houses/123/location.xml.cinf"

This is our Cache Metadata, stored on S3

If we found it, check the metadata for expiration:

If the cache entry has expired

Send DELETE => http://www.codingthriller.com/cinf/houses/123/location.xml.cinf

Causes a DELETE from S3 if it's really expired!

Do GET => http://www.codingthriller.com/houses/123/location.xml

Causes a push to the cache!

If not, look in the metadata for the URI of the cached data, (say, ba48f927.xml)

Do GET => http://cache.codingthriller.com/cache/07-02-2008/ba48f927.xml

Is a cached hit! We never talked to our real servers.

If there is no cached metadata:

Do GET => http://www.codingthriller.com/houses/123/location.xml

Causes a push to the cache!

So, the algorithm has gotten a bit more complex on the client side. We now have an intermediate "cinf file" that stores cache metadata: the time the data expires, as well as a SHA hash key.

If the cache has expired we submit a DELETE to the real server under the /cinf resource, which will then perform an S3 DELETE if the item is truly expired. Note that now, we can invalidate the cached item just by DELETE'ing the .cinf from S3, since clients will cause a re-push if there is no metadata. We then do the GET to the real server. If it didn't expire we use the SHA hash key and go to a URI under /cache to grab the data at that key.

So in the end, you implement this increased complexity on the client side and also need to add a cinf_controller that will delete the item from S3. What's nice is in AJAX or Flash you can perform the DELETE asynchronously, so your client does not feel any of the pain of waiting for the server to do the DELETE on S3.

Also, the "push to S3" server needs to be updated to generate the .cinf file and push it to S3 in addition to pushing the data. In that .cinf file you just include a unique hash and a timestamp. It can be useful to store the cached data in folders named by day, so you can quickly wipe out old data, as well. (As seen above in /cache/07-02-2008/ba48f927.xml)

Secure Resources

It's often the case that certain resource are only accessible by certain users. Your controllers might return a 'Access Denied' response, for example, for "/houses/123/location.xml" unless the requestor has logged in as the owner of the house. We can keep this security enforcement in our S3 cache, as well.

As noted above, the cached data now resides in a SHA hash keyed resource. This SHA hash can effectively serve as a "password" for the resource. A decent way to generate this hash is to salt the URI of the resource being cached. So, we'd hash "/houses/123/location.xml" + some random string, and that is where the data would get stored. This hash is included in the .cinf metadata file, so the client knows where to go get it.

But, we can do something better. If we split data accessibility into "security roles", and assign a unique key to each role, we can secure these cached resources. When the client starts up, you pass in information for the security roles of the logged in user. For example, an administrator role might have the key "abc123", which will be given to all clients logged in as administrators. It's important that these keys be transmitted over https, and not be persisted in cookies!

Now, when it comes time to push the data to S3, instead of pushing it to the hash, we push multiple copies of it, one for each security role. For example, if administrators can see this resource, we take the original hash we were going to store the data at, and salt it with the key for the administrator role. It now becomes impossible for a client which does not know this key to find the data on S3!

So, our metadata now includes three pieces of information*:

The expiration time
The SHA hash key for the data
The names of the roles which can see the data

And, when the client starts up, it receives (over HTTPS) all the security roles the user is a member of and their corresponding keys. Once the client sees a role it recognized in the metadata, it can then salt the SHA hash key and re-SHA it with that role key, and it is guaranteed to find the data at that key. It's also important that security role keys are regularly expired.

I'm no cryptologist, so I'd love to hear any feedback on how this technique can be exploited!

Conclusion

First we started with a simple S3 cache that pushed our RESTful service data out to S3 in the background. The client was updated to check the cache first, and then fall back on the server. Then, for invalidation, we introduced a metadata .cinf resource that the client checks first to ensure the data is not stale (and, importantly, tells the server when it sees expired data.) Finally, by storing the data at a salted hash referred in the .cinf file, and re-salting with security keys, we introduced role based security to our S3 cache that made it possible to cache privileged resources.

In the end, once implemented this caching technique can be almost entirely transparent. In my controllers, I simply pepper my methods with the correct invalidation calls, and can mark certain actions as being S3 cacheable. The back-end implementation and client take care of the rest. It's always important to integration test things like this to make sure your invalidation calls actually work!

I have yet to scale up with this solution, but my initial tests show that many, many RESTful service calls for my own application will be routed to S3 instead of my EC2 instances, a big win!

*I'd imagine there are slightly better ways to implement the cache metadata using HTTP headers -- unfortunately I cannot access HTTP headers in OpenLaszlo, so I went with a full HTTP body based approach here.

Thursday, December 6, 2007

Client-First Development with REST

AJAX = Client-Server
AJAX is all the rage these days, but web applications written in AJAX are really nothing more than a version of the client/server pattern. Unlike traditional ("Web 1.0") webapp development, developers now have to be more conscious about the distinct roles of the AJAX client and the HTTP web server. Web servers now often take on the role of just serving static content and raw data for the rich client to push onto the user interface.

So, developers constantly switch back and forth between client and server code. These two separate codebases are often written in different languages and have their own engineering constraints. Both require extensive testing, documentation, and refactoring.

YAGNI
A core principle of agile development is that "You aren't Gonna Need It" (YAGNI). That is, writing software that has no immediate purpose but "may" be useful soon is not worth writing. Code should only be written for a clear and imminent need.

This is a powerful principal, and can be applied to many aspects of development beyond coding. Instead of switching tools or platforms, for example, it's important to ask yourself if there is a clear immediate need!

YAGNI and Mock Objects
A increasingly popular technique for doing Test-Driven Development involves Mock Objects. Summarized, by developing software from the "top-down", mocking out the innards of yet-to-be-written classes, you often will write working code with less bugs. Really, this is yet another consequence of remembering YAGNI! By mocking out dependent parts of the system that have not yet been implemented, you end up with just what you *actually* need instead of extra things you *think* you may have needed.

This is a really condensed explanation, I'd urge you to read more about this via the Endo-Testing Paper and the BDD wiki.

YAGNI meets AJAX
When you apply the same thinking of YAGNI and Mock Objects to the client-server model of AJAX, you realize there's a very big thing that might become a Mock Object: the HTTP server!

"You're not going to need the server?" you say? Well, that's not totally true, but we shouldn't be forced to write nearly as much server code as we do.

An AJAX (or Flex, Silverlight, or OpenLaszlo) app needs to have a server running behind the scenes to pull data from. This can muck up iterative development; for large features we have to work from the bottom-up: modify the database schema, update the model tier, update the controller tier, update the views. This is a lot of time spent just to get to the point where we can try to "see" the new feature in the client.

Much like Mock Object testing lets us mock away objects that don't yet exist, what if we could mock the non-existent server while we write the client? We could then iteratively build and test client features without having to dive into server code. This is the same motivation for Mock Objects, and follows from YAGNI -- why bother writing code for the server before we are sure we need it?

Making a Mock of the Server
So, how do you mock a web server? You could, for example, write a little ruby program that responds to HTTP requests with fixture data. But, how do you translate the calls in your client to go to the fixture data? What if there are complex query parameters? Complexity creeps in; building a mock web server at first glance seems to be not worth the effort.

Well, we've got one more trick up our sleeve: REST.

What is REST?
REST is a style of building web services. I'm not going to explain it in depth here (read this book!) but I will mention what aspects of it are important to understand here.

The first element of a RESTful service that is important is that URLs are truly stateless. The service can not rely upon cookies, server state, or even authentication information to determine the content of a URL. Each URL corresponds to one or more representations of a resource. A resource could be "the list of names invited to a party" and the representation could be "the XML format needed for import into Outlook."

In REST, no matter who looks at the URL or what they loaded beforehand, the data at a URL doesn't change until the resource does. For example, /parties/123/invites.xml would have the same content regardless of who or when it is accessed, until the invitations themselves change.

Query parameter variables become rare as you build a RESTful service. URLs become richer and structured around the resources your service is responsible for.

Secondly, REST limits the number of operations you can perform on a resource to GET/PUT/POST/DELETE (and occasionally HEAD.) Instead of introducing custom operations on a resource beyond these, the service should instead evolve to expose new resources. For example, if we want a way to invite a user, a non-RESTFUL design would have a HTTP POST include a variable with "command=invite". A RESTful approach would expose an "invitations" resource, which someone could POST to in order to create an invitation for a user.

There are lots of reasons you should design RESTfully, but I am going to now finally explain how it can help building client-server web applications.

Client-First Development
Now for the main observation of this post:

If you commit to designing a web service RESTfully, you can stub out the service directly on the file system. Your client can then access these static files exactly the same as the real service, until it is built.

Web servers, out of the box, will expose the file system as RESTful URLs. Now, ignoring POST/PUT/DELETE, it should be clear that you can stub out data for a truly RESTful service directly on the file system.

When you put this all together, you get what I am calling Client-First Development. By committing to REST, and building a RIA, your development process can change into the following cycle:

Decide upon a new feature/bug fix.
Add or update fixture files for the REST service on disk.
Update and unit test the client to use the new features of the REST service.
Iterate, refining fixture data (the mocked representation and resources) and the client.
Once happy with the client, write the server code & unit tests.
Write an integration test with the working server and client.

For most servers you won't need to do anything to switch from static files to dynamic data. In Rails, for example, if a controller method does not exist, it will fall back on the static file system. Once you implement the controller, it will override your static files.

In practice, it's helpful to have two local hostnames or a subdirectory called "static" that will always serve up the static data so you can freely use the client with both. When the client starts up it simply needs to be told where to get the data, the URL might point to a 'real' web service or simply the root of a static directory on the server.

Tips and Tricks

I've found this to be an incredibly useful way to develop RIAs in a YAGNI fashion. By mocking out parts of my RESTful service on the file system as I build the client I am able to prototype the client without writing server code. I write the server code once I am happy with my client code and fixture data's representation of my resources.

One final question is how do you test non-GET operations using this technique? I simply log these operations for manual review for correctness. (Don't forget, integration testing is still necessary.) The side effects of a POST/PUT/DELETE are determined by the server, so the client can usually be built and tested without relying upon them. This works for me, but YMMV.

In conclusion, I think this synergy between REST and RIA is one that can greatly decrease the amount of cruft added to web services, by forcing you to build only what you need: if the client doesn't need it, then YAGNI!

Sunday, November 18, 2007

Rediscovering Hungarian Notation

Hungarian notation is probably the most undervalued, misunderstood concept in modern programming history. It's time to forget everything you know about Hungarian and seriously consider using it in your Ruby code (or any dynamic language, for that matter.)

When you write code, you're systematically writing a structured representation of an executable process. Loops, methods, classes, all these things have a relatively strict syntax that you must fit your problem into. By forcing you into a firm set of syntax and semantics (namely, a programming language), you are able to unambiguously communicate to three parties effectively: the computer, other programmers, and even yourself, a few months from now.

In rails, there is much talk of "convention over configuration." All this is saying is that having assumed syntax and semantics is more desirable than having to be explicit in describing these semantics each time you use them. For example, table names are pluralized versions of model class names. This translates onto programming languages themselves: what if you had to describe to the Ruby interpreter what a class "was" every time you declared one? It's in fact much nicer to simply assume the "convention" of having the keyword 'class' come packaged with a set of syntax and semantics. Everyone runs with this convention and everyone wins.

However, in nearly all languages there is one element of their syntax and semantics that falls short of providing you with the power of structure and "convention": naming. Nearly everyone has an opinion on naming. This includes naming local variables, instance variables, methods, classes, files, tables, and so on.

Why Human Readable Names Are Evil

The common viewpoint on naming, largely inspired as a backlash from Hungarian in my opinion, is that names should be "human readable." For example, you might have a class called Book. A book has a collection of Pages. And so on.

There are many, many problems with this approach. First and foremost, having these nice friendly names leads to one of the biggest poisons in programming: unfounded assumptions. How many times have you spent hours fighting a bug, finally squashing it by realizing one of your many assumptions turned out to be blatantly false? These can range anywhere from the low level ("x is never null") to the ridiculously and sometimes shockingly counterintuitive ("it doesn't matter if the laptop is on AC power or battery power.")

A class name like "Book" plays to our many experiences with anything called "Book" -- from the real tangible thing you find in the library to other abstractions in code for Books. Most of the assumptions you'll have about class Book will be wrong, and the few things you might get right (for example, this class Book might have a collection of Pages) almost never outweighs the confusion, bugs, and mental pain you receive from your many more false assumptions.

Nevermind talking out loud about these things. If you have a class named "Email", I dare you to try to get into a conversation with 5 other software engineers about the project that isn't riddled with qualifiers to say which type of Email you're talking about. (The "Email" class, an "Email" in your Outlook, an "Email" sent to the server, an "Email" row in the database, an "Email" address, etc.)

The sum of all these assumptions from 'human readable' names is that programmers grossly overestimate their understanding of bodies of code. You check out the trunk of a SVN project from Rubyforge, skim through it with all its nice names, and somehow you feel like you understand it. You know why? Not because you understand it, but because it was 'human readable.' Your familiarity with the names used led your mind to peaceful tranquility. I dare you try to fix a bug or add a feature, and you will find yourself in a quagmire of undoing all your assumptions you had about what the code was doing in the first place.

There is another problem with "human readable" names beyond injecting false assumptions into your brain. You run out of steam quickly when you try to tack on more and more English names into programming abstractions. You might soon find that you have 12 different types of books, each of which can be in 3 different types of lists, some of which are in memory and some on disk. What is the end result of this mismatch between English and programming? A variable named ListOfBooksFromPreferredLibraryInDatabaseOnLocalMachine! I don't care who you are, names like this are not "human readable" even though they are readable by a human.

Enter Hungarian Notation

Ok, so clearly we have a problem here. Programmers are never taught a real way to name things correctly, there is no systematic approach, and the 'common knowledge' for naming results in train-wrecks and false assumptions. This problem is compounded 100 fold when you introduce a dynamic language like Ruby, where you don't even have a compiler or smart IDE to help you along the way to decode what a name is *really* pointing to. How many times do you find yourself, when reading *someone else's* Ruby code, asking "so, is this variable storing a count of things in a list, or the list itself, or a map of keys to a set of lists?"

Let's be engineers for a second, and come up with a list of goals for the ultimate way to name abstractions. I'm going to take it straight from the horse's mouth, feel free to disagree in the comments:

When confronted with the need for a new name in a program, a good programmer will generally consider the following factors to reach a decision:

Mnemonic value—so that the programmer can remember the name.
Suggestive value—so that others can read the code.
"Consistency"—this is often viewed as an aesthetic idea, yet it also has to do with the information efficiency of the program text. Roughly speaking, we want similar names for similar quantities.
Speed of the decision—we cannot spend too much time pondering the name of a single quantity, nor is there time for typing and editing extremely long variable names.

Despite what you probably have read before, reaching the best-case-scenario for these goals is the point of Hungarian notation. Ruby already starts down this path. For example, methods which return a boolean generally end in ?, and instance variables start with an @. These are consistent, suggestive idioms which are quick to think of, quick to type, quick to read, and quick to grok. Let's take it to the next level though, shall we?

What Is Hungarian Notation? (Or, how to name a duck.)

You can read the original paper about Hungarian if you want a lot of the details, but I am going to describe it here in the simplest way possible. I'm also going to put my own little spin on it to best suit Rubyists.

Now, we've all heard of Duck Typing. You know, "Looks like a duck, walks like a duck" and so on. But really, what ducks are we talking about? If you call a method .to_s on a instance, that instance can be any object that responds_to? :to_s. This is our Duck, the "responds to :to_s" duck. That's usually where the discussion ends, though. Let's start naming our ducks!

Let's not fall into the trap of naming it something long and 'human readable' though, like DuckThatCanBecomeAString. Let's just give it a 3 or 4 letter 'tag'. How about ccs? Why ccs? Well, it's doesn't matter really. "But Greg, what the hell does ccs mean?!" I hear you screaming. Well, I already told you, it means "this instance responds to the method to_s!" I know, I know, you don't want to know what it means, you want to know what it stands for. Well, does it really matter? It might stand for something, but if you know the tag and what it is encoding for, it doesn't matter what it stands for.* Put succinctly, Hungarian tags are terse means of encoding programmer intent in a name, in the case of Ruby, it names our duck.

Every time you find yourself naming some variable, ask yourself, "what duck is this?" If it's a new kind of duck, come up with a new tag. If it's not, reuse one of your existing tags. It's encouraged to chain tags together! This is where the power of Hungarian starts kicking in: with short, terse, but meaningful tags, you are able to encode many, many times more information in your variable names. Once your brain is wired this way, you will quickly realize that the variable "mp_sid_rg_ccs" is a hash which maps stringified database id's to a list of objects which are known to respond to :to_s. I'd hate to even guess what you would have named this variable beforehand!

Kind vs. Type, and the Hungarian Dictionary

One of the biggest impediments to Hungarian in the past has been the disconnect in programmers mind between "kind" and "type." It's ironic that the breaking down of English semantics (one of the very reasons Hungarian was invented) led to its demise. Hungarian tags are not meant to encode "type", which in many languages like C# and Java is a first level construct that has a very specific meaning. Joel has a long explanation about this, if you're interested. No matter. In languages like Ruby, we don't really rely upon types so much as we rely upon Ducks, and at their core the concepts you already understand about Duck typing apply directly to the reason Hungarian tags were conceived.

In addition to your code and your tests/specs, your 'Hungarian dictionary' becomes a first class code artifact. This is usually a simple text file which lists all your tags and the meaning behind them. Here's a snippet from one of my own:

rg_ - List of
trg - Table-of
sid_X - School specified id for entity
fbid - Facebook id
lcode - Log code, used to easily grep log files
url - Stringified URL
uri - Real URI object
pth - Path to directory
fpth - Path to file
fn - lambda/function pointer
mil - Time in milliseconds
linf - list of info regarding something, ex, linfLrg could be [sid_cor, sid_sked, sid_lrg]
minf - map of info regarding something, ex, minf_lrg could be {:sid_cor => sid_cor}
col - column name

Armed with the Hungarian dictionary, every name in your code becomes a rich meaningful entity, instead of something a programmer pulled out of thin air. Often times, with a good enough set of Hungarian tags, naming a variable is a mindless exercise requiring almost no thought. Have a list of paths to files? rg_fpth.

Hungarian also allows for the inclusion of a "qualifier", which is a descriptive name that is usually a one off way to differentiate between variables with common tags. Specifically, say you have two lists of files. Both are going to be named rg_fpth. One list, however is of closed files, so for that you may name one rg_fpth_closed. "But wait, doesn't this then fall back on 'human readable' naming?", you say. Indeed, it does, but as stated above, these qualifiers should be used as one-offs. Often times you will find yourself re-using qualifiers, in which case you probably want to then refactor the qualifiers into their own unique tag. So, for paths to closed files, we might introduce fcpth, and then our variable above becomes rg_fcpth instead of rg_fpth_closed. Either way, we have more consistent and easy to grok names.

More than just Ducks

Ok, I lied a little bit. Hungarian tags can encode more than just Ducks. They can encode whatever the heck you want, as long as it's something that you will re-use and something that can be clearly documented in the Hungarian dictionary.

Ducks are limited: they are based upon what methods an instance responds to. We have many different uses for something that looks and acts like a string. You might have strings that contain HTML, strings that are newline terminated, strings that are the header of a file, and strings that contain comma seperated values. Same duck, different intent. If these concepts are important in your algorithm, and are unambigous, by all means, encode this intent as a Hungarian tag. The sheer power you get from adding a Hungarian tag called html that encodes "this is a string that contains html" is eye-opening, since you can quickly see in your code which strings have markup and which strings do not.

Why Prefixes?

Most of this discussion so far has been about naming variables. Before getting to the other stuff, I'd like to talk briefly about why it's called Hungarian notation.

Well, of course, Charles Simonyi is Hungarian, but that's not really why. You'll notice that the names in Hungarian rely upon a intent coding tag prefix. The key is the fact it's a prefix. Hungarian, like many other spoken languages, puts the noun before the adjective. Unlike English, it's "book blue" instead of "blue book".

I think you'll find that in programming, thinking about things this way makes a whole lot more sense. It also allows much easier navigation in your text editor. There is a nice sense of symmetry (or, if you're a new age Rubyist, joy and beauty) when you start names based upon what they are. Let's take a non-Hungarianized piece of code and just re-write it to use prefix based naming:

first_book_index = 5
last_book_index = 10

first_book_index.upto(last_book_index).each do |my_current_index|
open_books_list[my_current_index].close
end

It might be hard for the unconverted to see, but my Hungarianized brain has a hard time parsing these names because I am used to only having to scan the beginning of the name to know what kind of item the name points to. Really we have two kinds of ducks here, we have indexes, and we have list of open books. Reading the words "first, last, my_current, open_books" is just too confusing:


index_of_first_book = 5
index_of_last_book = 10

index_of_first_book.upto(index_of_last_book).each do |index|
list_books_open[index].close
end

This reads much nicer, especially if you've spoken a language other than English where you can naturally parse out the 'kind' information at the beginning of the name. Now, let's introduce some tags: 'i' will mean index into a list, 'rg' will be list, 'bk' will be our new name for class Book, and 'opbk' will be an open book.


i_first = 5
i_last = 10

i_first.upto(i_last).each do |i|
rg_opbk[i].close
end

Yes, it might look like garbage to a person reading the code for the first time. But once their brain naturally operates on the Hungarian wavelength, this code is extraordinarily easy to read and as a bonus will be the 'one true way' to write this short algorithm, since the names are canonical.

Naming Other Abstractions

Naming classes is easy: just make each class a Hungarian tag. Naming methods is a little different. The returned value of a method should have it's Hungarian tag at the start of the method. Often times methods simply convert one set of objects to another, all of which can be described as a Hungarian tag. (You'll notice this is really the case 90% of the time, once you've Hungarianized your code.) What used to be OpenBooksForGuy, or maybe GetListOfOpenBooksFromFirstNameAndLastName becomes simply rg_obk_from_pnst (where rg, obk are defined as above, and pnst is considered to be the pair of strings first name and last name.)

By putting the return kind as the start of the method name, you can quickly grok the gist of what your methods do by the names: usually they are just mapping a set of tags to another tag. Really, this is the essence of a function anyway, isn't it?

Final Thoughts

No doubt, the concept of Hungarian is a controversial one. It is plagued by three things:
a bad reputation, unintuitiveness, and the fact it is 'all-or-nothing'. Hungarianized code is not code you can just pick up and start understanding on its own. You need the Hungarian dictionary. However, once you have learned the Hungarian tags, the code becomes many times richer and naming variables becomes an exercise devoid of creativity: much like when you type the keyword 'class' or use block syntax. Code becomes more consistent, more patterns emerge, and the amount of complexity your mind can handle goes up severalfold.

However, in my experience, you have to go all or nothing with Hungarian. If you use it occasionally, all of the benefits are sucked out, since you ultimately fall back on the least common denominator of naming styles in your code. The power comes with 100% consistency using the conventions. Hungarian turns the opinion of a 'bad variable name' into an objective 'incorrect variable name'! With this comes power: it becomes possible for a code reviewer to find 'errors' in variable names. This is a good thing, since a certain algorithm will have a smaller set of correct implementations, meaning there is less ambiguity for those who are reading the code for the first time.

Once you start using it, you'll find communication within your team about things to be much more precise. Sure, it will sound like Greek to an outsider when you say "Well, shouldn't the R-S-T be attached to the end of the R-G-C-S-T not the R-G-SID-C-S-T?" but everyone will know exactly what you mean, and you will get a real answer to your question instead of a question *about* your question in return! Truth be told, what's happened is through using Hungarian, you have naturally evolved a Domain Specific Language around your code. And like any worthwhile language, there is a learning curve.

If a Hungarian movement takes hold in a dynamic language community like Ruby, no doubt many canonical Hungarian tags will take hold. The more tags which are considered canonical, the better. In the original Simonyi paper there are many such tags outlined, however they are heavily biased towards legacy languages like C. Ruby is a beautiful language full of many powerful conventions, by moving the power of convention into naming, there will be yet another huge boost in programmer productivity!

* - Often it's useful to have a 'pseudo-meaningful' tag. Once memorized, your brain will just think of it as a 'ccs', but when you are first learning the Hungarian dictionary, being able to bootstrap it in as "[C]an be [C]onverted to [S]tring" definitely helps to speed up memorization.