Sunday, November 18, 2007

Rediscovering Hungarian Notation

Hungarian notation is probably the most undervalued, misunderstood concept in modern programming history. It's time to forget everything you know about Hungarian and seriously consider using it in your Ruby code (or any dynamic language, for that matter.)

When you write code, you're systematically writing a structured representation of an executable process. Loops, methods, classes, all these things have a relatively strict syntax that you must fit your problem into. By forcing you into a firm set of syntax and semantics (namely, a programming language), you are able to unambiguously communicate to three parties effectively: the computer, other programmers, and even yourself, a few months from now.

In rails, there is much talk of "convention over configuration." All this is saying is that having assumed syntax and semantics is more desirable than having to be explicit in describing these semantics each time you use them. For example, table names are pluralized versions of model class names. This translates onto programming languages themselves: what if you had to describe to the Ruby interpreter what a class "was" every time you declared one? It's in fact much nicer to simply assume the "convention" of having the keyword 'class' come packaged with a set of syntax and semantics. Everyone runs with this convention and everyone wins.

However, in nearly all languages there is one element of their syntax and semantics that falls short of providing you with the power of structure and "convention": naming. Nearly everyone has an opinion on naming. This includes naming local variables, instance variables, methods, classes, files, tables, and so on.

Why Human Readable Names Are Evil

The common viewpoint on naming, largely inspired as a backlash from Hungarian in my opinion, is that names should be "human readable." For example, you might have a class called Book. A book has a collection of Pages. And so on.

There are many, many problems with this approach. First and foremost, having these nice friendly names leads to one of the biggest poisons in programming: unfounded assumptions. How many times have you spent hours fighting a bug, finally squashing it by realizing one of your many assumptions turned out to be blatantly false? These can range anywhere from the low level ("x is never null") to the ridiculously and sometimes shockingly counterintuitive ("it doesn't matter if the laptop is on AC power or battery power.")

A class name like "Book" plays to our many experiences with anything called "Book" -- from the real tangible thing you find in the library to other abstractions in code for Books. Most of the assumptions you'll have about class Book will be wrong, and the few things you might get right (for example, this class Book might have a collection of Pages) almost never outweighs the confusion, bugs, and mental pain you receive from your many more false assumptions.

Nevermind talking out loud about these things. If you have a class named "Email", I dare you to try to get into a conversation with 5 other software engineers about the project that isn't riddled with qualifiers to say which type of Email you're talking about. (The "Email" class, an "Email" in your Outlook, an "Email" sent to the server, an "Email" row in the database, an "Email" address, etc.)

The sum of all these assumptions from 'human readable' names is that programmers grossly overestimate their understanding of bodies of code. You check out the trunk of a SVN project from Rubyforge, skim through it with all its nice names, and somehow you feel like you understand it. You know why? Not because you understand it, but because it was 'human readable.' Your familiarity with the names used led your mind to peaceful tranquility. I dare you try to fix a bug or add a feature, and you will find yourself in a quagmire of undoing all your assumptions you had about what the code was doing in the first place.

There is another problem with "human readable" names beyond injecting false assumptions into your brain. You run out of steam quickly when you try to tack on more and more English names into programming abstractions. You might soon find that you have 12 different types of books, each of which can be in 3 different types of lists, some of which are in memory and some on disk. What is the end result of this mismatch between English and programming? A variable named ListOfBooksFromPreferredLibraryInDatabaseOnLocalMachine! I don't care who you are, names like this are not "human readable" even though they are readable by a human.

Enter Hungarian Notation

Ok, so clearly we have a problem here. Programmers are never taught a real way to name things correctly, there is no systematic approach, and the 'common knowledge' for naming results in train-wrecks and false assumptions. This problem is compounded 100 fold when you introduce a dynamic language like Ruby, where you don't even have a compiler or smart IDE to help you along the way to decode what a name is *really* pointing to. How many times do you find yourself, when reading *someone else's* Ruby code, asking "so, is this variable storing a count of things in a list, or the list itself, or a map of keys to a set of lists?"

Let's be engineers for a second, and come up with a list of goals for the ultimate way to name abstractions. I'm going to take it straight from the horse's mouth, feel free to disagree in the comments:

When confronted with the need for a new name in a program, a good programmer will generally consider the following factors to reach a decision:
  1. Mnemonic value—so that the programmer can remember the name.
  2. Suggestive value—so that others can read the code.
  3. "Consistency"—this is often viewed as an aesthetic idea, yet it also has to do with the information efficiency of the program text. Roughly speaking, we want similar names for similar quantities.
  4. Speed of the decision—we cannot spend too much time pondering the name of a single quantity, nor is there time for typing and editing extremely long variable names.
Despite what you probably have read before, reaching the best-case-scenario for these goals is the point of Hungarian notation. Ruby already starts down this path. For example, methods which return a boolean generally end in ?, and instance variables start with an @. These are consistent, suggestive idioms which are quick to think of, quick to type, quick to read, and quick to grok. Let's take it to the next level though, shall we?

What Is Hungarian Notation? (Or, how to name a duck.)

You can read the original paper about Hungarian if you want a lot of the details, but I am going to describe it here in the simplest way possible. I'm also going to put my own little spin on it to best suit Rubyists.

Now, we've all heard of Duck Typing. You know, "Looks like a duck, walks like a duck" and so on. But really, what ducks are we talking about? If you call a method .to_s on a instance, that instance can be any object that responds_to? :to_s. This is our Duck, the "responds to :to_s" duck. That's usually where the discussion ends, though. Let's start naming our ducks!

Let's not fall into the trap of naming it something long and 'human readable' though, like DuckThatCanBecomeAString. Let's just give it a 3 or 4 letter 'tag'. How about ccs? Why ccs? Well, it's doesn't matter really. "But Greg, what the hell does ccs mean?!" I hear you screaming. Well, I already told you, it means "this instance responds to the method to_s!" I know, I know, you don't want to know what it means, you want to know what it stands for. Well, does it really matter? It might stand for something, but if you know the tag and what it is encoding for, it doesn't matter what it stands for.* Put succinctly, Hungarian tags are terse means of encoding programmer intent in a name, in the case of Ruby, it names our duck.

Every time you find yourself naming some variable, ask yourself, "what duck is this?" If it's a new kind of duck, come up with a new tag. If it's not, reuse one of your existing tags. It's encouraged to chain tags together! This is where the power of Hungarian starts kicking in: with short, terse, but meaningful tags, you are able to encode many, many times more information in your variable names. Once your brain is wired this way, you will quickly realize that the variable "mp_sid_rg_ccs" is a hash which maps stringified database id's to a list of objects which are known to respond to :to_s. I'd hate to even guess what you would have named this variable beforehand!

Kind vs. Type, and the Hungarian Dictionary

One of the biggest impediments to Hungarian in the past has been the disconnect in programmers mind between "kind" and "type." It's ironic that the breaking down of English semantics (one of the very reasons Hungarian was invented) led to its demise. Hungarian tags are not meant to encode "type", which in many languages like C# and Java is a first level construct that has a very specific meaning. Joel has a long explanation about this, if you're interested. No matter. In languages like Ruby, we don't really rely upon types so much as we rely upon Ducks, and at their core the concepts you already understand about Duck typing apply directly to the reason Hungarian tags were conceived.

In addition to your code and your tests/specs, your 'Hungarian dictionary' becomes a first class code artifact. This is usually a simple text file which lists all your tags and the meaning behind them. Here's a snippet from one of my own:

rg_ - List of
trg - Table-of
sid_X - School specified id for entity
fbid - Facebook id
lcode - Log code, used to easily grep log files
url - Stringified URL
uri - Real URI object
pth - Path to directory
fpth - Path to file
fn - lambda/function pointer
mil - Time in milliseconds
linf - list of info regarding something, ex, linfLrg could be [sid_cor, sid_sked, sid_lrg]
minf - map of info regarding something, ex, minf_lrg could be {:sid_cor => sid_cor}
col - column name

Armed with the Hungarian dictionary, every name in your code becomes a rich meaningful entity, instead of something a programmer pulled out of thin air. Often times, with a good enough set of Hungarian tags, naming a variable is a mindless exercise requiring almost no thought. Have a list of paths to files? rg_fpth.

Hungarian also allows for the inclusion of a "qualifier", which is a descriptive name that is usually a one off way to differentiate between variables with common tags. Specifically, say you have two lists of files. Both are going to be named rg_fpth. One list, however is of closed files, so for that you may name one rg_fpth_closed. "But wait, doesn't this then fall back on 'human readable' naming?", you say. Indeed, it does, but as stated above, these qualifiers should be used as one-offs. Often times you will find yourself re-using qualifiers, in which case you probably want to then refactor the qualifiers into their own unique tag. So, for paths to closed files, we might introduce fcpth, and then our variable above becomes rg_fcpth instead of rg_fpth_closed. Either way, we have more consistent and easy to grok names.

More than just Ducks

Ok, I lied a little bit. Hungarian tags can encode more than just Ducks. They can encode whatever the heck you want, as long as it's something that you will re-use and something that can be clearly documented in the Hungarian dictionary.

Ducks are limited: they are based upon what methods an instance responds to. We have many different uses for something that looks and acts like a string. You might have strings that contain HTML, strings that are newline terminated, strings that are the header of a file, and strings that contain comma seperated values. Same duck, different intent. If these concepts are important in your algorithm, and are unambigous, by all means, encode this intent as a Hungarian tag. The sheer power you get from adding a Hungarian tag called html that encodes "this is a string that contains html" is eye-opening, since you can quickly see in your code which strings have markup and which strings do not.

Why Prefixes?

Most of this discussion so far has been about naming variables. Before getting to the other stuff, I'd like to talk briefly about why it's called Hungarian notation.

Well, of course, Charles Simonyi is Hungarian, but that's not really why. You'll notice that the names in Hungarian rely upon a intent coding tag prefix. The key is the fact it's a prefix. Hungarian, like many other spoken languages, puts the noun before the adjective. Unlike English, it's "book blue" instead of "blue book".

I think you'll find that in programming, thinking about things this way makes a whole lot more sense. It also allows much easier navigation in your text editor. There is a nice sense of symmetry (or, if you're a new age Rubyist, joy and beauty) when you start names based upon what they are. Let's take a non-Hungarianized piece of code and just re-write it to use prefix based naming:

first_book_index = 5
last_book_index = 10

first_book_index.upto(last_book_index).each do |my_current_index|
open_books_list[my_current_index].close
end

It might be hard for the unconverted to see, but my Hungarianized brain has a hard time parsing these names because I am used to only having to scan the beginning of the name to know what kind of item the name points to. Really we have two kinds of ducks here, we have indexes, and we have list of open books. Reading the words "first, last, my_current, open_books" is just too confusing:

index_of_first_book = 5
index_of_last_book = 10

index_of_first_book.upto(index_of_last_book).each do |index|
list_books_open[index].close
end

This reads much nicer, especially if you've spoken a language other than English where you can naturally parse out the 'kind' information at the beginning of the name. Now, let's introduce some tags: 'i' will mean index into a list, 'rg' will be list, 'bk' will be our new name for class Book, and 'opbk' will be an open book.

i_first = 5
i_last = 10

i_first.upto(i_last).each do |i|
rg_opbk[i].close
end
Yes, it might look like garbage to a person reading the code for the first time. But once their brain naturally operates on the Hungarian wavelength, this code is extraordinarily easy to read and as a bonus will be the 'one true way' to write this short algorithm, since the names are canonical.

Naming Other Abstractions

Naming classes is easy: just make each class a Hungarian tag. Naming methods is a little different. The returned value of a method should have it's Hungarian tag at the start of the method. Often times methods simply convert one set of objects to another, all of which can be described as a Hungarian tag. (You'll notice this is really the case 90% of the time, once you've Hungarianized your code.) What used to be OpenBooksForGuy, or maybe GetListOfOpenBooksFromFirstNameAndLastName becomes simply rg_obk_from_pnst (where rg, obk are defined as above, and pnst is considered to be the pair of strings first name and last name.)

By putting the return kind as the start of the method name, you can quickly grok the gist of what your methods do by the names: usually they are just mapping a set of tags to another tag. Really, this is the essence of a function anyway, isn't it?

Final Thoughts

No doubt, the concept of Hungarian is a controversial one. It is plagued by three things:
a bad reputation, unintuitiveness, and the fact it is 'all-or-nothing'. Hungarianized code is not code you can just pick up and start understanding on its own. You need the Hungarian dictionary. However, once you have learned the Hungarian tags, the code becomes many times richer and naming variables becomes an exercise devoid of creativity: much like when you type the keyword 'class' or use block syntax. Code becomes more consistent, more patterns emerge, and the amount of complexity your mind can handle goes up severalfold.

However, in my experience, you have to go all or nothing with Hungarian. If you use it occasionally, all of the benefits are sucked out, since you ultimately fall back on the least common denominator of naming styles in your code. The power comes with 100% consistency using the conventions. Hungarian turns the opinion of a 'bad variable name' into an objective 'incorrect variable name'! With this comes power: it becomes possible for a code reviewer to find 'errors' in variable names. This is a good thing, since a certain algorithm will have a smaller set of correct implementations, meaning there is less ambiguity for those who are reading the code for the first time.

Once you start using it, you'll find communication within your team about things to be much more precise. Sure, it will sound like Greek to an outsider when you say "Well, shouldn't the R-S-T be attached to the end of the R-G-C-S-T not the R-G-SID-C-S-T?" but everyone will know exactly what you mean, and you will get a real answer to your question instead of a question *about* your question in return! Truth be told, what's happened is through using Hungarian, you have naturally evolved a Domain Specific Language around your code. And like any worthwhile language, there is a learning curve.

If a Hungarian movement takes hold in a dynamic language community like Ruby, no doubt many canonical Hungarian tags will take hold. The more tags which are considered canonical, the better. In the original Simonyi paper there are many such tags outlined, however they are heavily biased towards legacy languages like C. Ruby is a beautiful language full of many powerful conventions, by moving the power of convention into naming, there will be yet another huge boost in programmer productivity!

* - Often it's useful to have a 'pseudo-meaningful' tag. Once memorized, your brain will just think of it as a 'ccs', but when you are first learning the Hungarian dictionary, being able to bootstrap it in as "[C]an be [C]onverted to [S]tring" definitely helps to speed up memorization.

2 comments:

William said...

Hungarian notation I feel at best is a silly notation with little redeeming value. A name is important and shouldn't need to be adorned with extraneous things that may or may not change as the code matures. Your example about unfounded assumption about code does not disappear with the introduction of Hungarian notation but increases. By attaching "other" information to the name you assume that this will help clarify its purpose/type/role, but in fact doing so runs a greater risk of being wrong, or at least wrong in the future. Better by far is to limit the size of routines/methods/functions such that local bindings can be seen as such and global bindings, if any, are obvious. In the end a good program should be like good prose. Good prose is good because of clear well thought out names, events, and metaphors. Good code should be the same and need not use pig-latin and made up nomenclatures to tell its story.

Chris Moorhouse said...

index_of_first_book = 5
index_of_last_book = 10

index_of_first_book.upto(index_of_last_book).each do |index|
list_books_open[index].close
end

Okay, I'm with you so far. The haphazard way in which English cannibalizes other languages leads to terrible inconsistency, making reading more difficult.

i_first = 5
i_last = 10

i_first.upto(i_last).each do |i|
rg_opbk[i].close
end

Uhh... what? By taking some letters out of the words (or even substituting some other letters wholesale for other words), you have magically made things more clear? That's funny, because all I see are the exact same meanings to words I don't know, mapped one-to-one with words I DO know. What advantage does "rg_" have over "list"?

I submit that what you've done here is a two-step process: First, understand what information about a name is useless and discard it, and formalize the remaining useful information. Second, take any English words that result from step one, and change them into some other language, who's syntax and grammar are less widely understood than English.

Why the second step? What's the benefit? How does that second step add any information, or make anything more clear? How is "rg_" objectively any more clear or more useful than "list"? How is "opbk" better than "open_books?" In both cases, you are communicating the exact same thing.

My business brain reels at the massive expenditure in training that this second step adds, when the first step is entirely adequate for removing unwanted or useless information from names.

To be perfectly clear: if it really "doesn't matter what it stands for", why can't I call a list of anything a list? Why is it necessary to invent another word for something that already has a word?