Saturday, January 28, 2012

How meta can you go?

I'm staring at a configuration dialog. While playing with the connection string for the past hour or so, I've learned the subtleties of how the server expects the parameters to be set in order to finally, dear Lord, let me into the database when connected to the production VPN. The trick, I've found, is to set my connection password to be the same as my user account, but leave the authentication credentials section blank, because otherwise the operating system's access control will prevent the impersonation of my account with no possible intervention allowed on my behalf. It so happens if I leave it blank I may instead enter my credentials di-rect-ly, "doing it live", and things seem to just magically work. So it goes.

I'm staring at a command prompt. I've tried uploading the tgz's four times so far, realizing on the third that if I batch them up into groups of three I no longer get timeout errors or corruption. A few hours back I was getting errors because I misconfigured the SSL certificate on the server. (Who knew you had to paste the public key twice, verbatim, into where St. Apache would be looking for it?) A hoop jumped througheth, another wall hit. With 3AM fast approaching, salvation seems imminent if I simply rate limit the upload script. I'll just throw a sleep(rand(15)) into it at just the right place, making it more human than any AI, really, and this should go swimmingly enough for St. Nagios to pick up the slack forthwith, or at least punt the ball to the Operations Team where it becomes Not My Problem.

I'm staring at a web page. Clearly the boxes are not lined up properly (albeit blurrily as my coffee is wearing off and my eyes are weakening, as I approach my mid-thirties in the web-game.) I've found that if I change the width: parameter for this particular div in this particular CSS class on this particular page in this particular browser, it seems to render appropriately. But *minor chord* Internet Explorer 7 renders it "without layout" and Safari, just to call into question determinism itself, moves it to the bottom of the page, blinking. (But who cares about Safari anyway?) If I abandon the Holy Box Model and indulge myself in the old circa-1999 TABLE with CELLPADDING and CELLSPACING technique, voilĂ , my retina sighs as the box clicks snugly into place. But, TABLE is a faux pas, and my ego will not let me check in such blasphemy else I would further defer my coronation as a Post-iPhone Designer five years hence. So instead, I de-dust the page-inflated, hyper-designed glossy CSS books (written for other, right-brainers) and try to decode what the heck is going on, mentally carving off the coats of historical disease-paint that yield the behavior of the ancient browsers still being mis-double-clicked within The Enterprise. Defeated, I Google "browser hacks" and in five minutes have my Pyrrhic victory.

As software engineers, our minds are filled with enormous amounts of meaningless fact-bits about how to make computer things work. This is the tempered Hammer we use to bend the bits to our will. We fight these battles every day, earning our layers of scar tissue and the right to mock the "bartenders" of the Genius Bar. It is painful, and wrong. (The battles, not the mocking.) We learn the behavior and tricks necessary to get by, often skating past true understanding in order to level-up, ever closer to the elusive Product Launch. In other words, we want to get to the next rush, the next high. Sometimes, these decisions bite us in the ass, but this isn't nearly as common as it should be. (And for this we give thanks to our more disciplined forefathers who provided us sturdy digital shoulders to stand on.) As @dave_revell puts it, "The more I push code through static analysis, the more I'm amazed that computers boot at all." And so it goes. But this human tendency to unintentionally tighten the digital boots (we invented!) to stomp us in the face forevermore is not really the point of this post, since even when we do find CSS-zen, we may be even a bigger fool for spending the time doing so.

The greatest trick a software engineer ever played is convincing himself he was working on an important problem. When dancing with the tiger, fiddling with connection strings, we get an amazing, exhilarating rush when, thank God almighty, we finally Get The Damn Thing To Work. It is truly glorious, as we see our monstrosity come to life in the manner we originally intended only a few short hours (days? weeks?) ago. And, since our fortunes are decided at the speed of light, once the tide has turned, victory is swift and decisive indeed. (At least for now.) But then, like the blistering pinpricks of the morning sun after a night of drunken escapades, realization hits us on the other side asking: "what was I doing, again?"

The endorphins that flow from clearing a technical hurdle are opiates that impede truly critical thinking. These opiates are released anytime we run a program and see it do what we expect, refresh a web page and have it load correctly, see our UI tweak yield improvements in metrics, or watch a log and see stack traces vanish. I have found this observation to be sobering and perhaps even perscriptive in helping me decide on how go about my day. I'd like to think I'm a recovering addict who has, through sheer will and introspection, finally sobered up. For the rest of you, this is an intervention.

The title of this post is, "How meta can you go?" What I mean by this is, how many layers of abstraction can you reflect above what you are doing before you are looking into the void and can say nothing else that puts fire in your belly? If the number of layers is many, good for you -- you are on the right track. If it is shallow, perhaps you should re-evaluate what you are doing with your mind, yourself, and your life. Perhaps I am cynical in thinking that surely some layer will terminate in the void, with you realizing the pipeline you sit on terminates at /dev/null. But that is neither here nor there, the point remains that mentally moving through such abstractions can provide real clarity for you as a engineer and human and is a healthy mental exercise.

An example is in order. In all of my tales above, of which I hope you can relate to at least one, I am trying to decipher the internal magic that is occurring inside the stubborn metal box that, currently, is preventing me from becoming its master. What do I mean by "going meta?" It's simple: during such struggles, ask: "Why?"

Why are you fighting with connection strings? The answer can certainly be reasonable: "It is because I would like to get this database to talk to that computer program." Fine. It could also be unreasonable: "It is because I would like to take a look at the data myself." Maybe fine, but maybe not: is there any other way to accomplish your goal that will work more easily? Often there is, and it may be escaping you because you are unconsciously tightening the tourniquet to inject the opiate of Getting The Damn Thing To Work.

This concept on its own is not revolutionary, but lets keep unwinding the thread. Why are you trying to get that database to talk to that computer program? "Because I would like it to render this web page." Why are does it need to render the web page? "Because if it renders the web page then people will see the data correctly." Why does this matter? "Because people, having seen the data, will make appropriate decisions about what to do next." Why does this matter?

As you move up the chain, you will find yourself transitioning out of the realm of engineering and into the realm of mind, philosophy, and, for some of you, religion. Your own value system comes into play here. Here are some examples of how you might respond to your inner Socrates:

"It matters because if things work correctly I will be paid next month. That matters because it will allow me to keep my family healthy and safe. That matters because ... "

"Since I work in ecommerce, it matters because if decision makers make good decisions, their businesses will run more efficiently and they will have more income. This matters because I know the businesses we work with are honest, hard working people and would benefit greatly from such income. And that matters because ... "

"Since I work in the military, it matters because if decision makers make good decisions, people on the battlefield will be safer and our military objectives will be met with less loss of life. That matters because our military objectives are to bring peace to so-and-so, and that matters because ..."

Now, depending on your own philosophical views or religious views you may find yourself at a "peak" where it is no longer possible to explain why something matters. If you're an atheist, like I am, you may find that in fact you conclude that your actions don't "matter" above a certain level of abstraction, since, "in the long run, we are all dead." Or, you may feel that it does matter because of the rewards that await you in the afterlife. But, this is also not the point of the post. The point of this post is learning to regularly walk the path you take to this peak.

Along the climb, you may find yourself facing some uncomfortable truths. You may, for example, realize that the specific micro-task you are doing is not really providing progress towards a goal. You may be chasing the endorphins.

You may find that the micro-task is valid, but the higher level task you are working on today or this week is short-sighted or over-engineered, full of "interesting problems" without real purpose.

You might find that the task you are working on this week is truly interesting and provides real value to the project. But you might find that the project itself is not very meaningful. You might even be forced to realize what you are working on is unappealing, wrong, or maybe even, for a few of you, evil.

But let's say it is not evil and it is meaningful, you may realize that it will actually have no practical impact due to the particular strategy being taken. Or you may realize that it is flawed due to the people involved in the project. Or you may simply just realize that your time is better spent doing something else, despite the good intentions, since there is such a low likelihood of success.

And so you'll continue upwards you find yourself at a point where you must make the leap from "justifiable" to "unjustifiable" in your actions.

Of course, just because something is unjustifiable doesn't mean you shouldn't do it. History itself can arguably be described as "stories of people who did, according to their contemporaries or to us today, unjustifiable things" (for better or worse.) But, my view is that the farther you had to go to find your crossover into the unjustifiable, the less likely you're simply chasing those local endorphins that are poisons to clear thinking and happiness. If you manage to be content with your answers sans those eternal questions about the purpose of life itself, you should be sleeping well as your questions put you in good company.

Sadly, many people will not get this far: they may find they are caught up in the process of chasing opiates by clearing arbitrary hurdles: be they technical, financial, or cultural. There's nothing wrong with being a dreamer, a tinkerer, or a capitalist. We are all all of these things at some point or another. We make mistakes, we learn, and we understand the world better as we do. But often such mistakes are unnecessary since we are chasing the opiate of the Now at the cost of true happiness and accomplishment. Being aware of your awareness is a wonderful thing. Remember the chains you may have set yourself in can be easily broken out of once you see them as such.

Thursday, September 29, 2011

The iPad Information Diet

Like most programmers, I've struggled for years to pry myself away from distractions. As engineers, we're in the precarious situation where the tool we use to build things can also be used to spend countless hours consuming information. And boy, do we love our information, us nerds. The result? "That code is compiling, I'll just check Hacker News real quick." Oops, game over.

I've tried all sorts of things to curb random web surfing, but nothing ever stuck. The problem has always been the same as with any addiction: going cold turkey doesn't work. There are countless number of tools to turn off, block, ration, or measure web surfing to help you cope, but they sit on my machine installed but never used. As soon as you slip a little, you're fully back in again. I needed something that was high impact enough that it would make a difference, but low impact enough that it'd stick.

Introducing the iPad Information Diet. The goal: turn your computer back into a workstation, so you always are using it to Get Shit Done (TM). I've been doing this new technique for a week and it seems to be working. I'm getting more done for work, reading books (remember those?), going outside, and relaxing instead of surfing the web when I shouldn't be.

What you'll need:
  • Your computer
  • A text editor
  • An iPad (or tablet computer, I guess)
  • A little willpower

Step 1: Make your computer a computer again.


This is simple. You want to make it so your internet acts similar to how it did 10 years ago. Block everything you think that will distract you. How? I did it by dropping in the following hosts file into /etc/hosts and restarting my browsers:

https://gist.github.com/1252168

Boom. Welcome back to 1999, no social media, no cat videos, no news aggregators, just Google. Your poison may vary. Some sad souls might need to block Wikipedia, but I'm not quite to that point.

Step 2: Give yourself an out, you junkie.

"But Greg!" you exclaim, I need to go to Twitter and LinkedIn during my daily activities. Ok, fine. I'll do one better than that. You can surf all you want, even perezhilton.com, but you have to do it on the iPad. Plug that sucker into your main machine, sit it right next to your keyboard on its Smart Cover (you've got one of those, right?) It's your 10-inch portal to the modern internet. Use it whenever you want.

Like this:


Step 3: Start getting shit done.

Why does this work? I've found that I still get my fix on my iPad, but the time spent tapping around there is limited. I'm not quite sure why, it's probably for a few reasons:
  • It's a little cumbersome to get around on compared to the speed of my computer I'm used to. Typing sucks.
  • I *know* I am screwing around. There's a large context switch, and I am constantly aware of the fact I am in 'break' mode, not multitasking between 'work' and 'break' like it used to feel doing both on my main machine.
  • It's off to the side, so it feels like a secondary task. I'm not letting it take over my main task, which is on my big iMac.
That's it! I've been doing this for around a week, and it's provided a nice balance for me. I still have a pulse on all the sites I usually read, but their ability to suck me away from my work has been drastically limited. I've compartmentalized the distractions into a small secondary screen. I've had no urge to back out of this setup, since there's nothing onerous about it, it's just keeping work and play separated, and making sure I don't conflate the two.

Monday, May 3, 2010

The Social House of Cards

I believe that right now we are living in a bizarre era of the web that will be looked back on as a charming, fun, but mostly misguided period. Much like Altavista was a transitional technology that led to Google, Facebook (and it's newly acquainted sidekick, Zynga) is a product borne out of this unique era of the web's evolution, but will ultimately be seen as a transitional technology, not an everlasting empire. Unless of course the house of cards therein is rebuilt into something with a stronger foundation.

For the rest of this post, I ask the reader to remember Facebook's ultimate monetization path is assumed to be advertising. And of course, Zynga's monetization relies upon the mechanics of social gaming. The former, dubbed "Social Advertising" is discussed at length by Mark Maunder here.

Social advertising and social gaming are built upon several things which cannot be taken for granted, in the long term. Remove any one of these, and one or both of these revenue models evaporate.
  • People do not mind that their information be shared with third party sites to enhance their online experience.
  • People act as though virtual objects and virtual labor (mouse clicks) are real objects and real labor.
  • Social networking and gaming are the most appealing use of day-job free time.
I will go on to argue various scenarios which undermine these assumptions.

Scenario 1: Creepy Ads

In online advertising, there is a fine line between "relevant advertising" and "creepy advertising." Creepy advertising is like porn: hard to define, but you know it when you see it. If I were to try, creepy advertising can best be quantified by how much an ad makes you feel like someone is spying on you.

Remember: spying is often simply observing what people do in public. The creepiness stems not from the observing itself (we watch people doing things in public all the time), it's recording, aggregating, and acting upon those observations.

This type of aggregation and action lies at the heart of Facebook's monetization strategy. They want to aggregate and act on public, personal information in a way that is profitable but not creepy. They haven't succeeded yet, so this is why they are moving more and more towards creepy in their search for profitability.

The next step, as we all know, is to start sharing your information with third parties and letting them do the acting. This is opt-out, meaning it will start happening generally without people being aware of it. This will certainly result in more relevant advertising for some, but risks resulting in creepy advertising as well.

The hard problem for Facebook is it may only take one creepy impression in a pile of thousands to nudge a user towards turning off data sharing. And once turned off, that kills that user as a revenue source for Facebook. Not only that, but a user creeped out by Facebook-driven ads will probably loudly announce this to their friends, ironically, through Facebook. The network effects Facebook enables could very well turn into a self-destruct mechanism, a meme spread through the social graph in a few short days causing everyone to opt-out.

By this, Facebook is now trusting all third parties to act in a non-creepy manner. Viewed this way, Facebook's move to loosen data sharing rules seems at best naively reckless at worst a hail-mary pass to find a revenue stream.

Scenario 2: Privacy Disaster or Tragedy

The consolidation and availability of personal information and social connections opens up the possibility of it being used for much more devious things than creepy advertising. With services like Amazon Elastic Map Reduce and Elastic Compute Cloud, terabyte-scale data processing is accessible to your average consumer.

This algorithmic power, combined with the massive datasets available on the social graph, open up the possibility that seemingly impossible to answer questions about individuals can be answered, even if those folks are not even part of the graph itself.

So much information is within easy reach that it may enable some large-scale crime, disaster, or, god forbid, tragedy that goes beyond the typical identity thefts and security breaches we've grown used to reading about.

Such an event, reported by the media, that causes irreparable harm will pull the rug out. Afterwards, the public will be clamoring for, and the politicians eager to provide, regulation of the flow and analysis of aggregated social data. Such an event and the subsequently enacted laws will not only legally restrict the flow of information, but would cause users to think twice about what they share, similar to the scenario above.

Pending such an event, the information flow underpinning social advertising is one stroke of a pen away from disappearing.

Scenario 3: A Virtual Waste


Virtual objects & labor are valued by users in a way very similar to real objects and real labor. Farmville relies upon people valuing their "investment" in virtual labor to plant and harvest their crops. Virtual gifts on Facebook rely upon people valuing the experience of receiving those objects as if they were real.

It may always be possible to expect people to project real-world values onto virtual things, but there's a catch. No matter how real a virtual thing feels, a real thing is more real. For example, if I could buy you a virtual puppy or a real puppy, which one do you think you'd value more? If you spent an hour virtually harvesting your crops and an hour tending to your backyard garden, which do you think you'd feel more strongly invested in?

Ultimately, the valuation of virtual goods has a shelf life, until a real good can displace those virtual goods in whatever scenario they are being valued. Similarly, virtual labor towards virtual accomplishments is valuable only until actual labor towards actual accomplishments can be had with the same effort and time. (For more on this, check out Achievement Porn.) Time is the great equalizer in the case of labor, so virtual labor towards a virtual farm could be displaced by real labor towards something else altogether.

This observation doesn't tell you what these displacements are, only tells you that they exist. The day that an enterprising entrepreneur can provide a game that provides all the toil and benefit of the virtual labor on Farmville, but with tangible, physical results, is the day Farmville will seem quaint, silly, and a "waste of time."

Scenario 4: The Free Time Gap

Social gaming and social networking has apparently tapped into a vast reservoir of free time people have during their day jobs. What is so special about this time?
  • It is peppered with interruptions.
  • It is time in which people do not want to think too hard, since that energy is reserved for said interruptions. They'd rather spend the time being entertained.
  • It is time in which people have access to a computer with internet access.
  • It is time in which they likely need to be somewhat discreet about how they are using it.
As you can see, social gaming and social networking fits these constraints quite nicely. It can be context switched out of quickly, requires little thought, and lets them use the computer they already have Excel opened up in another window on.

You'll also notice that there's nothing that necessitates social networking & gaming as the best use of this time. Previously, this time was likely filled with mindless web surfing, shopping, or flash gaming. Social networking & gaming provide a more enjoyable use of this time (while not breaching the constraints), but there's nothing preventing another phenomenon to come and fill this gap.

Going back to the previous scenario, anything that supplants virtual goods and virtual accomplishments with real goods and accomplishments (while not sacrificing the above constraints) is a good candidate, since it will be valued higher by most users. Again, no solutions here, but the gap should be made clear: peoples' free time may be getting sunk into social networking & gaming, but this gap can be filled by something else, as it has been before.

Conclusions


I can't predict the future, but much of what I see around me on the social web screams "transitionary."

The social web devalues privacy and over-values virtual goods and accomplishments. On either of these counts (if not both), time will tell if they are sustainable.

Most major shifts in society are one of over-reaching due to new possibilities, followed by retraction as the consequences of that over-reaching are realized. (As noted by Auren Hoffman here.) I see this dynamic playing out with the personal information being posted to the web. I think web 3.0 will be less about new information flowing to new places and more about controlling (or stopping) the flow of the information that's already out there.

Beyond this, I see the rise of virtual goods & labor as a bizarre anomaly, similar to the CD player that played MP3s on CDR's or the projection TV, a solution that uncovers the problem, to which an appropriate solution will be applied. I think this solution is on the horizon.

Friday, May 22, 2009

Improving Wolfram|Alpha's User Experience

Wolfram|Alpha is a new type of tool: a computational knowledge engine (CKE.) Sadly, while the engine itself is brilliant, it is a struggle to use. Wolfram's team tried hard to make it feel just like a search engine. This has caused lots of confusion and ultimately limits the system's potential. By understanding a few fundamental ways in how W|A differs from a classical search engine (CSE), several UI changes fall out naturally that will greatly improve the experience of using W|A.

The W|A query model

A CSE like Google has a very simple query model. Users think of what they want, and type it into a box. They hit enter or click a button, and a new page loads showing them the results. They can page through these results, or do another search right there on the results page. It's simple, and obviously works.

As many have noticed, this model breaks down when misapplied to a CKE like W|A. A better UI can emerge by exploiting the differences between the two models, instead of ignoring them:
  • Invalid Queries - Of the entire set of possible free-form queries, a CKE is unable to produce a result for a vast majority of them. There are many reasons why a query may be invalid, but for now let's consider all invalid queries the same.
  • Intrinsic Query Value - CKE queries themselves are intrinsically valuable. Unlike CSE queries, they:
    - often compute new information that didn't exist before.
    - can be complex and challenging to construct.
    - can provide instructive guidance for new users.
    - provide a live taxonomy of system capabilities.
  • Compact Results - CKE result pages generally fit on one or two pages and have a compact UI to explore the computed results.
Alleviating Invalid Query Pain

The most common W|A complaint seen is the prevalence of the "Wolfram Alpha isn't sure what to do with your input" page. This has become the BSOD of W|A. Sadly, there is no reason for it to even exist.

Many free-form queries are going to be invalid, particularly for new users. The way to address this is to prevent users from submitting invalid queries. Here's what I'd do for the first two iterations of improving this:
  • When hitting enter, perform an AJAX request that checks if the query is valid. If not, show the "Wolfram doesn't know..." message under the search bar. Otherwise, go through to the results. This will eliminate a large part of the pain.
  • Next, provide a visual indicator icon that can be in three states: red, yellow, green. The indicator is yellow as the user is typing. It is red when the current text in the search box is an invalid query. It is green when the query is valid. AJAX can be used to continually parse the query on the server. When the user hits enter, if the indicator is green then results appear. Otherwise, we fall back on the behavior above.
Note that the AJAX request for query validation does not need to compute a full result - it simply needs to determine if a result will be computable. (This should alleveate performance concerns.)

Exploiting query value

Valid queries are valuable. They compute new information, can be challenging to construct, and can help guide new users. By exploiting the stream of valid queries, the W|A experience can be improved dramatically.

There are countless ways to do so. For example:
  • Above the search bar could be a series of high level categories such as Biology, Physics, and Mathematics. Mousing over these would provide a dense and dynamic view of a large number of queries relevant to that category. Unlike the current documentation, these queries should be chosen by the system based upon the live stream. Specifically, they could be chosen based upon popularity, relevance, and even Digg-like community voting.
  • A "Google-Suggest on steroids" approach. When typing a query, dynamically showing similar relevant queries and perhaps compacted results right below the search bar would be an effective way to quickly explore the engine.
  • Users should be able to name, parameterize, and share their queries. For example, if I create a query that computes the cost of a trip via gas mileage (this example was shared via Twitter a few days ago,) I should be able to share that query in a useful manner.
Exploiting result page structure

The W|A results page is often short and to the point. SEO issues aside, it makes you wonder, why bother with a results page at all?

I'd propose that once a the UI above is in place for exploring and using the engine, it would make sense to simply avoid transitioning between any pages at all. I envision a UI where the user quickly can find a query that works using the techniques above, and by hitting enter, the results are computed and appear on screen instantly via AJAX. Subsequent searches simply replace the previous results, and a query history is maintained to easily navigate between previous results.

Final points


I think if all these changes are implemented, using W|A will be a much smoother and satisfying experience. There would no longer be any place for the user to feel like they've made a mistake, and quickly digging into the capabilities of the system would be more natural and intuitive. The W|A engine is begging for a more intuitive, rich, and fun exploratory interface. The exciting part is that a lot of these things can be implemented by anyone once a search API is available, so I hope we'll be seeing improvements like this soon!

Thursday, April 24, 2008

SimpleDB Full Text Search, or How to Create a Search Engine in 24 Hours

SimpleDB Primer

I've started digging into Amazon SimpleDB, which is one of these newfangled "Hashtable in the cloud" systems that allows you to scale from 1MB to 100GB datasets with relative ease. It has a rather limited data model though, here are some of the highlights of things you can bump your head against:
  • The main paradigm is you store items in domains. As of right now, domains are limited to 10GB and you are limited to 100 domains.
  • Items contain one or more attribute-value pairs. Each attribute can have multiple values, though you are limited to 256 attribute-value pairs per item. Furthermore, a given value for an attribute can be at most 1024 bytes. Within a domain, the rumor is you can store at most 250 million attribute-value pairs, across all items. Ie, if the database were a spreadsheet, you can have at most 250 million cells filled in.
  • Queries let you perform range & equality operators (including 'starts-with') comparing attribute values to constant values. For example ['name' = 'greg'] will match all items with the name attribute equal to the string 'greg'. Queries can have multiple conditions on a given attribute, and can union and intersect sets of results across different attributes. Evaluation is from left to right, though, so you want to put your filter conditions that remove the most items first; you cannot induce precedence through parentheses.
  • Queries return paged result sets of item names. Item names can then be pulled in parallel from SimpleDB to get the full set of attribute-value pairs for each item.
  • All values are compared lexicographically, so you need to pad integer values with zeroes, etc.
  • SimpleDB grants you eventual consistency. This means you can insert an item and not see it for a few seconds. We'll return to this pleasant feature below.
  • You can't sort. Seriously.
Inverted Index on SimpleDB

I'm working on a new web application, and one of its centerpieces is going to be quick, painless text search through reams of data. I have spent the last day or so taking a stab at getting a inverted index working on SimpleDB, with some success. Here's what I want to be able to do at the end of the day:

  • Query for specific terms, with AND and OR boolean conditions. (So, 'this OR that' will match items with either 'this' or 'that' in the index, etc.) AND is implicit between terms.
  • Support prefix matching, so 'app*' will match anything that has a term that beings with "app"
  • Support exact phrase matching with quotes, so '"merry go round"' (with quotes) will only match those terms in that exact order.
  • Support fields, so I can index a document (similar to Lucene) with multiple fields like "content" and "tags" and search within fields. So 'tags:oranges OR tags:apples' will search for items with the field "tags" containing the term oranges or the term apples.
  • Support 'sharding' - I may have additional attributes I want to include in the index to help limit the search before searching for a term. For example, documents may belong to a folder, and I might want to perform a text search *within* a folder.
Terms & Analyzer

To avoid re-inventing the wheel, I use Ferret's DefaultAnalyzer to tokenize my input and filter my terms. I won't dig into it here, go pick up a book on Lucene if you want to learn how to do this.

Index Approaches Considered


Disclaimer: I am a relative newb when it comes to this stuff. I am posting this because I had yet to see anyone discuss this in detail, I am sure a search guru will pwn me, do it in the comments!

The overall goal is to build an inverted index of terms that will let us quickly look up which documents contain which terms. This isn't revolutionary, but the tricky part is coming up with an index format that fits into SimpleDB and gets me all my search features. Here's a few approaches I had tried or thought about initially:

Item per Document /w Splitting

So, we create an item per document. For each term in the document, we add an attribute-value pair with the attribute=term and the value=empty string. When we run out of attribute-value pairs to use for an item (remember, we can only have 256), split it into another item. ("Paging") So, a document may span multiple items, and each item has a bunch of blank values under each of its attributes, which are named after its terms.

To find an item with a set of terms, for example, 'apple', just do a query ['apple' starts-with ''], intersected for multiple term matches. On its surface this approach seems to work but it has some problems.

First, we cannot do prefix matching, since the 'starts-with' operator only applies to the attribute values, not the attribute names. We *could* inject attributes for every substring of each term (ie, "hello" would show up as "h_", "he_", "hel_", up to "hello") but we would run out of tuple space quickly.

To facilitate exact phrase matching, we can put a certain number of N following terms into the index at each attribute, instead of the empty string. This will allow us to match exact phrases up to N terms long. For example, if we pick N = 5, we will store the next 5 terms after each term under the attribute. Since we can have multiple values, we just index the N next terms for every instance of the term. (Splitting as necessary.) Performing an exact phrase match is as simple as doing a "starts with" condition on the attribute for the first term in the phrase.

It's important we add a space to the end of the query and the value in the index though, otherwise the exact phrase match "hello world" would also match "hello worldly". If we store "hello world " (with a space) in the index and do the match with the space, we can be sure we always match the end of the term correctly. Note that we inflate our index in this case, if it's N = 5, our index could be 5 times larger than our source text. This is less of a problem though since we're using SimpleDB, which we're assuming is cheap.

To facilitate fields we can inject the field name into the attribute name. So if the field "tags" has the term "apple" we can put in an attribute "tags__apple" instead of just "tags". It is getting a little scary.

To facilitate sharding, we can stick the extra attributes to shard into the document item(s), no brainer.

This technique suffers from "AND deficiency" though, a major problem, which is outlined later.

Overall score: 6/10 (prefix matching is impossible, field matching is a pain, AND deficiency)

Item per Term /w Paging

In this approach, we create an Item for each term. Again, we split into multiple items as we run out of room in the attribute-value count. The attributes are the fields, so we don't have a ton of them ("content", "tags".) The values of the attributes are the id's of the documents where the term appears for that field. We can get away with this, because remember, attributes can have multiple values.

A big pain with this approach is that when you index a document, you first need to check if an item already exists for each term. If it does, you simply add the document id to the appropriate attribute(s). If not, create a new one (or split, if you've maxed out the current one.) You also need to keep a flag around noting which item for a term is considered the latest.

Having an update or insert approach vs. an insert only approach burns you big time with SimpleDB because of the elephant in the room: eventual consistency. Since SimpleDB makes no guarantees about when new items are going to show up, if you are going to check for an existing item and update it (as is the case here,) you can get burned if that item is not being reflected in the database yet. The way around this is to perform the operations in bulk, using an in-memory cache that is periodically flushed out to SimpleDB. You still run the risk that partially filled items will end up showing up in your index, though, since once you free some cache space if you keep indexing you may not see the new items for a few seconds.

Fields are facilitated as mentioned (they are the attributes), and we can get suffix matching by simply doing the search using starts-with on the term for the item.

Exact phrase matching is not possible without a hack. You can stick a list of the next N terms into an attribute separate from the list of source document id's, but the problem is even if you match you have no idea which of the many documents actually contains the phrase. (You just know at least *one* of them does.) SimpleDB won't guarantee the ordering of the attribute values on an item, since, as a general rule, you can't rely upon sorting or insertion order. (Otherwise the Mth term phrase match would be the Mth document id.) If you hack the id's for the documents onto the end of the terms lexically with a delimiter, however, you can pull them out afterwards. (As I said, an ugly hack.)

Finally, sharding causes a problem. Since there are many documents to an item, to introduce sharding we would have to add our sharding attributes to the item and then only include documents in the item for that item's shard. So, for example, if we have a folders attribute we want to shard against, if we have K folders, we will have up to K items for a given term since we cannot put two documents from different folders into the same item. (Otherwise, we'd get incorrect document id's if we constrained against the folder attribute.) All this complexity spills up into the "update or insert" method, so we can see this is spiraling out of hand.

This approach is also "AND deficient", outlined below.

Overall Score: 4/10 (Insert/Update technique with buffers, sharding is painful)

Item per Term per Document


This approach ends up being the simplest, really. We do not have any multiple-valued attributes. Each term for each document gets an item, so we'll have a lot of items, but at least remain sane. The attribute name is the field we saw the term in. We store the next N terms as the value as above, to support exact phrase matching. We get prefix matching for free with 'starts-with'. Sharding works by just adding attributes to the item, since there is just one document per item we don't get into any of the mess noted above. This is also a "insert only" technique, so we don't have to worry about eventual consistency woes. Huzzah!

But alas, this approach also suffers from the "AND deficiency" problem. It's finally time to explain.

Overall Score: 8/10 (AND Deficiency still stings)

AND Deficiency


To support boolean expressions, the naive approach would seem to just map "union" and "intersection" operators onto the search query, and run with it. For "OR" queries, we can get away with just "union"ing everything together, and get the set of documents. This applies to any of the above index techniques. "AND", however, is a bigger pain.

It's clear that if we have an item per term per document (the last indexing technique listed), "intersection" doesn't perform AND, since by definition there's only one term per item. (Ie, if we wanted to find documents with "apple" and "oranges", ['content'='apple'] intersection ['content'='orange'] always yields the empty set.)

Regardless of how we index, however, as long as there can be more than one item per document, we cannot rely upon the intersection operation to perform AND. Note that in all our index techniques above, we can have more than one item per document, since in the worst case we run out of our 256 attribute-value pairs per item and have to "split" the document into multiple items. Due to the 256 attribute-value pair limitation, it is impossible for us to avoid the reality that documents must span multiple items, and hence, we cannot perform AND strictly from within a SimpleDB query.

The reason the final index approach ends up being simplest, is because it embraces this limitation and lives with it while optimizing the index around all of our other use cases. It was only after banging my head against the wall with these other approaches that I swallowed deeply and came to grips to the fact that AND was just not going to be possible server side.

So, we need to do some merging on our web application server to support AND. To support AND, we simply need to just perform a set based intersection on the server instead of SimpleDB. Of course, this can get hairy if we match a lot of items for each term, but it's something we have to live with. Since we cannot sort the items on SimpleDB either, we can't perform anything similar to a merge join in memory. So, for 'apples AND oranges', we're stuck, and need to load all the id's matching apples, and then page through the result for oranges and discard those who did not show up in both. With simpleDB, there are probably ways to parallelize this in-memory intersection, and you can also be smart in how you order the fetching (ie, if there are many more apples than oranges) but I leave these as an exercise to the reader!

Conclusions

In the end, I was able to get the item per term per document index working correctly. I wrote a crappy query parser that builds a mini-AST and then recursively generates and runs each query, merging results at each node in the tree. The major pending issue is that, as stated, AND nodes result in the merging of result sets in memory on the server. (This also is necessary for OR nodes that have any AND nodes under them.) Fortunately, for my particular use case, I don't see this being much of a problem, and most of the heavy lifting is still going to be taking place on SimpleDB. I don't expect worst case to be any more than a 10000x10000 intersection, which should be pretty speedy. The nice thing about text search is provided you have a good tokenizer you can usually prune down the data set pretty well after the first term.

At the end of the day (literally, this was done over the course of 24 hours), I have a theoretically limitless textual index running out in the magical cloud, complete with shards, fields, suffix matching, and exact phrase matching, with a, albeit crappy, boolean operator implementation that works. Sweet.

Thursday, January 17, 2008

The Pain of Cross-Domain

In my last post, I talked about S3 Caching of RESTful services. The sad part about this approach is that you inevitably will run into cross-domain issues if you build your client to hit S3. So, in this post, I'll run down what techniques you can do to deal with cross-domain issues.

Cross Domain Flash
This is the easiest one. Flash natively supports cross domain communication, but only via a whitelist published on the server hosting the data. So, for example, if your Flash file is hosted at mydomain.com, and it wants to talk to myservice.com, the myservice.com host must publish a crossdomain.xml file that whitelists mydomain.com. Once whitelisted, the communication can take place. Below is an example of a crossdomain.xml file you'd place at the root of myserver.com's server to whitelist Flash widgets loaded at mydomain.com.

<?xml version="1.0"?>
<!DOCTYPE cross-domain-policy
SYSTEM "http://www.macromedia.com/xml/dtds/cross-domain-policy.dtd">
<cross-domain-policy>
<allow-access-from domain="mydomain.com" />
</cross-domain-policy>

For S3, just throw up a crossdomain.xml file with public read access for the domain you host your Flash from, and you're good to go.

Cross Domain AJAX
For Cross-Domain AJAX (communication via Javascript) you have a few different options. The one most people will point you to is to use a proxy. Specifically, you build a web service that retrieves and serves content from another domain, that you host in your own domain. I don't like this approach because it pushes more load onto your infrastructure, and the main goal behind S3 Caching is to take load off of your servers. (In addition to minimizing points of failure.)

We could dedicate a web server in our domain to just shuffle data between S3 and our clients as a proxy, but in that case, we might as well store our files locally on disk instead of using S3. S3 caching is meant to be used for scenarios where even adding a single machine to your enterprise is cost-prohibitive (for example, if you are bootstrapping), so we shouldn't consider this approach.

The solution I've settled upon is to use a library by a clever fellow named Jim Wilson called SWFHttpRequest. It provides an interface nearly identical to XMLHttpRequest, so it can be relatively easily dropped into existing Javascript frameworks. For example, to use SWFHttpRequest with Prototype, you simply add this code:
Ajax.getTransport = function() {
return Try.these(
function() {return new SWFHttpRequest()},
function() {return new XMLHttpRequest()},
function() {return new ActiveXObject('Msxml2.XMLHTTP')},
function() {return new ActiveXObject('Microsoft.XMLHTTP')}
) || false;
};
I use OpenLaszlo, and here's a bit of code that will get your DHTML Laszlo apps to use it:

if (typeof(LzHTTPLoader) != "undefined") {
// Laszlo
LzHTTPLoader.prototype.open = function (method, url, username, password) {
if (this.req) {
Debug.warn("pending request for id=%s", this.__loaderid);
}

{
try {
this.req = new SWFHttpRequest();
} catch (err) {
this.req = window.XMLHttpRequest? new XMLHttpRequest():
new ActiveXObject("Microsoft.XMLHTTP");
}
}
this.__abort = false;
this.__timeout = false;
this.requesturl = url;
this.requestmethod = method;
}
}

I basically just took the code from the Laszlo source and jacked in a try/catch that attempts to bind to an SWFHttpRequest if it is available.

The downsides? First, and foremost, this requires Flash 9. We're mostly working with rich clients here though, and Flash 9 has 95% penetration, so we'll deal. Additionally, code must be updated such that it will not begin AJAX requests until SWFHttpRequest has been loaded. Basically, you just need to hold off until the global SWFHttpRequest has been defined, and you're good to go. (Of course, you can omit this step if you don't require cross-domain AJAX on start-up.)

Cross-Domain IFrame Communication
The last (and arguably most painful) type of cross-domain communication is between frames. Tassl has several nested IFRAMEs which are used for drawing HTML over the OpenLaszlo Flash app. There are cases where I need to trigger events in the outer frame from the inner frames, which are generally cross-domain. (The inner frames will often have content loaded from S3.)

Well, the way you can do it is by using the fragment identifier. The general technique is outlined by a Dojo developer. In short: the inner frame sets the location of the outer frame to be the same as it was plus a fragment identifier which contains the data to send. The outer frame polls it's own location to check for changes in the fragment identifier. When there is a change, the outer frame assumes it came from the inner frame, and it takes the fragment out as the data.

The naive technique worked until IE7 came along, and then things got messy when you have an iframe within an iframe that need to talk to one another. Microsoft responded with an explanation. I'll re-hash the solution here.

The following set up works in IE7:

-- outer frame (hosted at facebook.com) (Frame Z)
-- app iframe (hosted at secure.tassl.com) (Frame A)
-- app content iframe (hosted at S3) (Frame B)
-- data pipe iframe (hosted at secure.tassl.com) (Frame C)

Note that this issue doesn't occur if there is no "Frame Z".

So, the innermost application iframe (Frame B) must include its own iframe for communication now. (Frame C) As opposed to before, where it could simply pass back a fragment identifier directly to frame A.

To send data from from B to Frame A, Frame B must set the location on Frame C in the same way it did before directly to Frame A. Note that it must do this via window.open(url, nameOfFrameC), instead of trying to read Frame C's location object. By using window.open, you can avoid angering IE's cross-domain security checks.

To receive the data from Frame C, Frame A must have a polling routine that does the following:
  • Checks to see if Frame C exists in the dom. If not, bail out. (Using window.document.frames (in IE) or window.frames (in Firefox/Safari/Opera))
  • Acquire a reference to Frame C using another window.open trick:
    • var iframeC = window.open("", nameOfFrameC);
  • Read and decode iframeC.document.location.hash in the same way as the pre-IE7 technique.
So, by loading Frame C in the same domain as Frame A, and using the two window.open techniques, you are able to communicate from Frame B to Frame A. It's hairy, but it works. Make sure you integration test!

Friday, December 21, 2007

Caching RESTful Services on S3

One of the best ways to speed up a web application is to introduce some form of caching. Fragment caching on app servers, object caching using memcached, and asset caching on S3 are all good tricks to take some of the load off your servers. In this post I'm going to talk about how to cache web services on amazon S3 dynamically, creating an intermediate cache between your web browser and the app servers.

Client-First Development: Revisited

The main point of the previous post was to show how you can do Behavior Driven Development with a RIA (in AJAX, Flex, or OpenLaszlo) by stubbing out a REST service on your filesystem. This works great with Rails, because your files can be served as static fixtures you can quickly mold your REST service's resources to meet the demands for your client. You can write the client before the server, which results in less programming since you "get it right" on the server on the first try.

Well, there's another little trick we can do once you've seen this in action.

S3 as a RESTful Cache

Amazon S3 allows you to throw files up into "buckets" (using a RESTful service, no less) which can then be retrieved at a URL. A file named "houses/123/location.xml" in the bucket "cache.codingthriller.com" can be retrieved at http://cache.codingthriller.com/houses/123/location.xml, if you just configure your DNS to point cache.codingthriller.com to s3.amazonaws.com.

Are you thinking what I'm thinking? That's right, we can "stub" out our RESTful service on S3 just like we did on the file system, since the pattern is exactly the same. The difference is this time instead of it being fixtures for testing, S3 will actually store a cached version of our data. This can be a *huge* win, since all requests to S3 will skip over our server farm altogether, going to the "infinitely" scalable Amazon data center!

How the RESTful Cache Works

Having a rich client in AJAX or Flash is key for this to work smoothly. If you did client-first development, you already have code in place that makes it possible for you to "point" your client to a REST service. When building it, you pointed it to the static fixture files. When the server is running, you point it to the server to get real, dynamic data.

Since you've abstracted it, it's not too much of a leap to just point it to your S3 cache. If you designed your resources correctly, there shouldn't be any major problems with getting the same data from S3 vs. the application servers.

So, for every request to the REST service in your RIA, you should break it up into two steps. Say we are looking for "/houses/123/location.xml":
  • Check http://cache.codingthriller.com/houses/123/location.xml. If we get it, great.
  • If we 404, go to http://codingthriller.com/houses/123/location.xml. (This is our real server.)
That first request is generally pretty snappy, since it's hitting Amazon, and if we managed to get that data, we've avoided a full request to our Rails app. Money!

Pushing it out

Of course, this seems great and all, but how does the data get to S3? It's probably not the best idea to block the request until the data is pushed, so I've built an off-line service that goes through and pushes the data using a simple queue.

I created a model class called S3Push that stores the uri and the data. When a request comes in that I want to cache, I have an after_filter that pulls the body from the response and stores it into the database. Then I have a simple service that just goes through these rows and pushes them over to S3 into the right bucket.

There's a lot of details involved, but the main point is that once a request comes in, it will eventually propagate to a identical URL on S3!

Invalidation

Of course, it's important to invalidate this cache -- you don't want to serve up stale data. There are two types of invalidation: direct invalidation and timeout invalidation.

On the server side, if a request is made that affects resources in the S3 cache, you can just submit a DELETE request to S3 to remove the data. For example, if the location of a house changes via an HTTP PUT to the resource, you just can DELETE the resource from S3. Once another request comes in for that resource, it will enqueue it to be re-cached.

Timeout invalidation is a bit trickier.

Timeout Invalidation

If you are able to always invalidate the cache correctly, then you are done. However, sometimes you can't always be sure when something is invalidated. (For example, data you pulled from an external resource may be changed without you being notified.)

One way of taking care of this problem is to have the server periodically remove data from S3 that it thinks might be stale, using a cron job or background service. This is a perfectly legitimate way to do things.

However, this could introduce a 'single point of failure' in your web server farm. If you have one machine periodically cleaning out the S3 cache, if it dies, stale data could be "stuck" indefinitely. This is a different problem than having your "push to S3" service die, since in that case you simply lose the performance benefits of the cache. Painful for your servers, yes, but probably not a show-stopper.

Client-Based Invalidation

So, the approach I took was a client-centric one. While the server still has the final say when something is DELETE'd from S3, I try to take advantage of the rich clients I have running within all my users' browsers.

For this to work, the algorithm has to change a bit. (This change becomes useful later, when we introduce security!) For resource "/houses/123/location.xml", we now:
  • Check for "http://cache.codingthriller.com/houses/123/location.xml.cinf"
    • This is our Cache Metadata, stored on S3
  • If we found it, check the metadata for expiration:
    • If the cache entry has expired
      • Send DELETE => http://www.codingthriller.com/cinf/houses/123/location.xml.cinf
        • Causes a DELETE from S3 if it's really expired!
      • Do GET => http://www.codingthriller.com/houses/123/location.xml
        • Causes a push to the cache!
    • If not, look in the metadata for the URI of the cached data, (say, ba48f927.xml)
      • Do GET => http://cache.codingthriller.com/cache/07-02-2008/ba48f927.xml
        • Is a cached hit! We never talked to our real servers.
  • If there is no cached metadata:
    • Do GET => http://www.codingthriller.com/houses/123/location.xml
      • Causes a push to the cache!
So, the algorithm has gotten a bit more complex on the client side. We now have an intermediate "cinf file" that stores cache metadata: the time the data expires, as well as a SHA hash key.

If the cache has expired we submit a DELETE to the real server under the /cinf resource, which will then perform an S3 DELETE if the item is truly expired. Note that now, we can invalidate the cached item just by DELETE'ing the .cinf from S3, since clients will cause a re-push if there is no metadata. We then do the GET to the real server. If it didn't expire we use the SHA hash key and go to a URI under /cache to grab the data at that key.

So in the end, you implement this increased complexity on the client side and also need to add a cinf_controller that will delete the item from S3. What's nice is in AJAX or Flash you can perform the DELETE asynchronously, so your client does not feel any of the pain of waiting for the server to do the DELETE on S3.

Also, the "push to S3" server needs to be updated to generate the .cinf file and push it to S3 in addition to pushing the data. In that .cinf file you just include a unique hash and a timestamp. It can be useful to store the cached data in folders named by day, so you can quickly wipe out old data, as well. (As seen above in /cache/07-02-2008/ba48f927.xml)

Secure Resources

It's often the case that certain resource are only accessible by certain users. Your controllers might return a 'Access Denied' response, for example, for "/houses/123/location.xml" unless the requestor has logged in as the owner of the house. We can keep this security enforcement in our S3 cache, as well.

As noted above, the cached data now resides in a SHA hash keyed resource. This SHA hash can effectively serve as a "password" for the resource. A decent way to generate this hash is to salt the URI of the resource being cached. So, we'd hash "/houses/123/location.xml" + some random string, and that is where the data would get stored. This hash is included in the .cinf metadata file, so the client knows where to go get it.

But, we can do something better. If we split data accessibility into "security roles", and assign a unique key to each role, we can secure these cached resources. When the client starts up, you pass in information for the security roles of the logged in user. For example, an administrator role might have the key "abc123", which will be given to all clients logged in as administrators. It's important that these keys be transmitted over https, and not be persisted in cookies!

Now, when it comes time to push the data to S3, instead of pushing it to the hash, we push multiple copies of it, one for each security role. For example, if administrators can see this resource, we take the original hash we were going to store the data at, and salt it with the key for the administrator role. It now becomes impossible for a client which does not know this key to find the data on S3!

So, our metadata now includes three pieces of information*:
  • The expiration time
  • The SHA hash key for the data
  • The names of the roles which can see the data
And, when the client starts up, it receives (over HTTPS) all the security roles the user is a member of and their corresponding keys. Once the client sees a role it recognized in the metadata, it can then salt the SHA hash key and re-SHA it with that role key, and it is guaranteed to find the data at that key. It's also important that security role keys are regularly expired.

I'm no cryptologist, so I'd love to hear any feedback on how this technique can be exploited!

Conclusion

First we started with a simple S3 cache that pushed our RESTful service data out to S3 in the background. The client was updated to check the cache first, and then fall back on the server. Then, for invalidation, we introduced a metadata .cinf resource that the client checks first to ensure the data is not stale (and, importantly, tells the server when it sees expired data.) Finally, by storing the data at a salted hash referred in the .cinf file, and re-salting with security keys, we introduced role based security to our S3 cache that made it possible to cache privileged resources.

In the end, once implemented this caching technique can be almost entirely transparent. In my controllers, I simply pepper my methods with the correct invalidation calls, and can mark certain actions as being S3 cacheable. The back-end implementation and client take care of the rest. It's always important to integration test things like this to make sure your invalidation calls actually work!

I have yet to scale up with this solution, but my initial tests show that many, many RESTful service calls for my own application will be routed to S3 instead of my EC2 instances, a big win!

*I'd imagine there are slightly better ways to implement the cache metadata using HTTP headers -- unfortunately I cannot access HTTP headers in OpenLaszlo, so I went with a full HTTP body based approach here.

Thursday, December 6, 2007

Client-First Development with REST

AJAX = Client-Server
AJAX is all the rage these days, but web applications written in AJAX are really nothing more than a version of the client/server pattern. Unlike traditional ("Web 1.0") webapp development, developers now have to be more conscious about the distinct roles of the AJAX client and the HTTP web server. Web servers now often take on the role of just serving static content and raw data for the rich client to push onto the user interface.

So, developers constantly switch back and forth between client and server code. These two separate codebases are often written in different languages and have their own engineering constraints. Both require extensive testing, documentation, and refactoring.

YAGNI
A core principle of agile development is that "You aren't Gonna Need It" (YAGNI). That is, writing software that has no immediate purpose but "may" be useful soon is not worth writing. Code should only be written for a clear and imminent need.

This is a powerful principal, and can be applied to many aspects of development beyond coding. Instead of switching tools or platforms, for example, it's important to ask yourself if there is a clear immediate need!

YAGNI and Mock Objects
A increasingly popular technique for doing Test-Driven Development involves Mock Objects. Summarized, by developing software from the "top-down", mocking out the innards of yet-to-be-written classes, you often will write working code with less bugs. Really, this is yet another consequence of remembering YAGNI! By mocking out dependent parts of the system that have not yet been implemented, you end up with just what you *actually* need instead of extra things you *think* you may have needed.

This is a really condensed explanation, I'd urge you to read more about this via the Endo-Testing Paper and the BDD wiki.

YAGNI meets AJAX
When you apply the same thinking of YAGNI and Mock Objects to the client-server model of AJAX, you realize there's a very big thing that might become a Mock Object: the HTTP server!

"You're not going to need the server?" you say? Well, that's not totally true, but we shouldn't be forced to write nearly as much server code as we do.

An AJAX (or Flex, Silverlight, or OpenLaszlo) app needs to have a server running behind the scenes to pull data from. This can muck up iterative development; for large features we have to work from the bottom-up: modify the database schema, update the model tier, update the controller tier, update the views. This is a lot of time spent just to get to the point where we can try to "see" the new feature in the client.

Much like Mock Object testing lets us mock away objects that don't yet exist, what if we could mock the non-existent server while we write the client? We could then iteratively build and test client features without having to dive into server code. This is the same motivation for Mock Objects, and follows from YAGNI -- why bother writing code for the server before we are sure we need it?

Making a Mock of the Server
So, how do you mock a web server? You could, for example, write a little ruby program that responds to HTTP requests with fixture data. But, how do you translate the calls in your client to go to the fixture data? What if there are complex query parameters? Complexity creeps in; building a mock web server at first glance seems to be not worth the effort.

Well, we've got one more trick up our sleeve: REST.

What is REST?
REST is a style of building web services. I'm not going to explain it in depth here (read this book!) but I will mention what aspects of it are important to understand here.

The first element of a RESTful service that is important is that URLs are truly stateless. The service can not rely upon cookies, server state, or even authentication information to determine the content of a URL. Each URL corresponds to one or more representations of a resource. A resource could be "the list of names invited to a party" and the representation could be "the XML format needed for import into Outlook."

In REST, no matter who looks at the URL or what they loaded beforehand, the data at a URL doesn't change until the resource does. For example, /parties/123/invites.xml would have the same content regardless of who or when it is accessed, until the invitations themselves change.

Query parameter variables become rare as you build a RESTful service. URLs become richer and structured around the resources your service is responsible for.

Secondly, REST limits the number of operations you can perform on a resource to GET/PUT/POST/DELETE (and occasionally HEAD.) Instead of introducing custom operations on a resource beyond these, the service should instead evolve to expose new resources. For example, if we want a way to invite a user, a non-RESTFUL design would have a HTTP POST include a variable with "command=invite". A RESTful approach would expose an "invitations" resource, which someone could POST to in order to create an invitation for a user.

There are lots of reasons you should design RESTfully, but I am going to now finally explain how it can help building client-server web applications.

Client-First Development
Now for the main observation of this post:

If you commit to designing a web service RESTfully, you can stub out the service directly on the file system. Your client can then access these static files exactly the same as the real service, until it is built.

Web servers, out of the box, will expose the file system as RESTful URLs. Now, ignoring POST/PUT/DELETE, it should be clear that you can stub out data for a truly RESTful service directly on the file system.

When you put this all together, you get what I am calling Client-First Development. By committing to REST, and building a RIA, your development process can change into the following cycle:
  • Decide upon a new feature/bug fix.
  • Add or update fixture files for the REST service on disk.
  • Update and unit test the client to use the new features of the REST service.
  • Iterate, refining fixture data (the mocked representation and resources) and the client.
  • Once happy with the client, write the server code & unit tests.
  • Write an integration test with the working server and client.
For most servers you won't need to do anything to switch from static files to dynamic data. In Rails, for example, if a controller method does not exist, it will fall back on the static file system. Once you implement the controller, it will override your static files.

In practice, it's helpful to have two local hostnames or a subdirectory called "static" that will always serve up the static data so you can freely use the client with both. When the client starts up it simply needs to be told where to get the data, the URL might point to a 'real' web service or simply the root of a static directory on the server.

Tips and Tricks

I've found this to be an incredibly useful way to develop RIAs in a YAGNI fashion. By mocking out parts of my RESTful service on the file system as I build the client I am able to prototype the client without writing server code. I write the server code once I am happy with my client code and fixture data's representation of my resources.

One final question is how do you test non-GET operations using this technique? I simply log these operations for manual review for correctness. (Don't forget, integration testing is still necessary.) The side effects of a POST/PUT/DELETE are determined by the server, so the client can usually be built and tested without relying upon them. This works for me, but YMMV.

In conclusion, I think this synergy between REST and RIA is one that can greatly decrease the amount of cruft added to web services, by forcing you to build only what you need: if the client doesn't need it, then YAGNI!

Sunday, November 18, 2007

Rediscovering Hungarian Notation

Hungarian notation is probably the most undervalued, misunderstood concept in modern programming history. It's time to forget everything you know about Hungarian and seriously consider using it in your Ruby code (or any dynamic language, for that matter.)

When you write code, you're systematically writing a structured representation of an executable process. Loops, methods, classes, all these things have a relatively strict syntax that you must fit your problem into. By forcing you into a firm set of syntax and semantics (namely, a programming language), you are able to unambiguously communicate to three parties effectively: the computer, other programmers, and even yourself, a few months from now.

In rails, there is much talk of "convention over configuration." All this is saying is that having assumed syntax and semantics is more desirable than having to be explicit in describing these semantics each time you use them. For example, table names are pluralized versions of model class names. This translates onto programming languages themselves: what if you had to describe to the Ruby interpreter what a class "was" every time you declared one? It's in fact much nicer to simply assume the "convention" of having the keyword 'class' come packaged with a set of syntax and semantics. Everyone runs with this convention and everyone wins.

However, in nearly all languages there is one element of their syntax and semantics that falls short of providing you with the power of structure and "convention": naming. Nearly everyone has an opinion on naming. This includes naming local variables, instance variables, methods, classes, files, tables, and so on.

Why Human Readable Names Are Evil

The common viewpoint on naming, largely inspired as a backlash from Hungarian in my opinion, is that names should be "human readable." For example, you might have a class called Book. A book has a collection of Pages. And so on.

There are many, many problems with this approach. First and foremost, having these nice friendly names leads to one of the biggest poisons in programming: unfounded assumptions. How many times have you spent hours fighting a bug, finally squashing it by realizing one of your many assumptions turned out to be blatantly false? These can range anywhere from the low level ("x is never null") to the ridiculously and sometimes shockingly counterintuitive ("it doesn't matter if the laptop is on AC power or battery power.")

A class name like "Book" plays to our many experiences with anything called "Book" -- from the real tangible thing you find in the library to other abstractions in code for Books. Most of the assumptions you'll have about class Book will be wrong, and the few things you might get right (for example, this class Book might have a collection of Pages) almost never outweighs the confusion, bugs, and mental pain you receive from your many more false assumptions.

Nevermind talking out loud about these things. If you have a class named "Email", I dare you to try to get into a conversation with 5 other software engineers about the project that isn't riddled with qualifiers to say which type of Email you're talking about. (The "Email" class, an "Email" in your Outlook, an "Email" sent to the server, an "Email" row in the database, an "Email" address, etc.)

The sum of all these assumptions from 'human readable' names is that programmers grossly overestimate their understanding of bodies of code. You check out the trunk of a SVN project from Rubyforge, skim through it with all its nice names, and somehow you feel like you understand it. You know why? Not because you understand it, but because it was 'human readable.' Your familiarity with the names used led your mind to peaceful tranquility. I dare you try to fix a bug or add a feature, and you will find yourself in a quagmire of undoing all your assumptions you had about what the code was doing in the first place.

There is another problem with "human readable" names beyond injecting false assumptions into your brain. You run out of steam quickly when you try to tack on more and more English names into programming abstractions. You might soon find that you have 12 different types of books, each of which can be in 3 different types of lists, some of which are in memory and some on disk. What is the end result of this mismatch between English and programming? A variable named ListOfBooksFromPreferredLibraryInDatabaseOnLocalMachine! I don't care who you are, names like this are not "human readable" even though they are readable by a human.

Enter Hungarian Notation

Ok, so clearly we have a problem here. Programmers are never taught a real way to name things correctly, there is no systematic approach, and the 'common knowledge' for naming results in train-wrecks and false assumptions. This problem is compounded 100 fold when you introduce a dynamic language like Ruby, where you don't even have a compiler or smart IDE to help you along the way to decode what a name is *really* pointing to. How many times do you find yourself, when reading *someone else's* Ruby code, asking "so, is this variable storing a count of things in a list, or the list itself, or a map of keys to a set of lists?"

Let's be engineers for a second, and come up with a list of goals for the ultimate way to name abstractions. I'm going to take it straight from the horse's mouth, feel free to disagree in the comments:

When confronted with the need for a new name in a program, a good programmer will generally consider the following factors to reach a decision:
  1. Mnemonic value—so that the programmer can remember the name.
  2. Suggestive value—so that others can read the code.
  3. "Consistency"—this is often viewed as an aesthetic idea, yet it also has to do with the information efficiency of the program text. Roughly speaking, we want similar names for similar quantities.
  4. Speed of the decision—we cannot spend too much time pondering the name of a single quantity, nor is there time for typing and editing extremely long variable names.
Despite what you probably have read before, reaching the best-case-scenario for these goals is the point of Hungarian notation. Ruby already starts down this path. For example, methods which return a boolean generally end in ?, and instance variables start with an @. These are consistent, suggestive idioms which are quick to think of, quick to type, quick to read, and quick to grok. Let's take it to the next level though, shall we?

What Is Hungarian Notation? (Or, how to name a duck.)

You can read the original paper about Hungarian if you want a lot of the details, but I am going to describe it here in the simplest way possible. I'm also going to put my own little spin on it to best suit Rubyists.

Now, we've all heard of Duck Typing. You know, "Looks like a duck, walks like a duck" and so on. But really, what ducks are we talking about? If you call a method .to_s on a instance, that instance can be any object that responds_to? :to_s. This is our Duck, the "responds to :to_s" duck. That's usually where the discussion ends, though. Let's start naming our ducks!

Let's not fall into the trap of naming it something long and 'human readable' though, like DuckThatCanBecomeAString. Let's just give it a 3 or 4 letter 'tag'. How about ccs? Why ccs? Well, it's doesn't matter really. "But Greg, what the hell does ccs mean?!" I hear you screaming. Well, I already told you, it means "this instance responds to the method to_s!" I know, I know, you don't want to know what it means, you want to know what it stands for. Well, does it really matter? It might stand for something, but if you know the tag and what it is encoding for, it doesn't matter what it stands for.* Put succinctly, Hungarian tags are terse means of encoding programmer intent in a name, in the case of Ruby, it names our duck.

Every time you find yourself naming some variable, ask yourself, "what duck is this?" If it's a new kind of duck, come up with a new tag. If it's not, reuse one of your existing tags. It's encouraged to chain tags together! This is where the power of Hungarian starts kicking in: with short, terse, but meaningful tags, you are able to encode many, many times more information in your variable names. Once your brain is wired this way, you will quickly realize that the variable "mp_sid_rg_ccs" is a hash which maps stringified database id's to a list of objects which are known to respond to :to_s. I'd hate to even guess what you would have named this variable beforehand!

Kind vs. Type, and the Hungarian Dictionary

One of the biggest impediments to Hungarian in the past has been the disconnect in programmers mind between "kind" and "type." It's ironic that the breaking down of English semantics (one of the very reasons Hungarian was invented) led to its demise. Hungarian tags are not meant to encode "type", which in many languages like C# and Java is a first level construct that has a very specific meaning. Joel has a long explanation about this, if you're interested. No matter. In languages like Ruby, we don't really rely upon types so much as we rely upon Ducks, and at their core the concepts you already understand about Duck typing apply directly to the reason Hungarian tags were conceived.

In addition to your code and your tests/specs, your 'Hungarian dictionary' becomes a first class code artifact. This is usually a simple text file which lists all your tags and the meaning behind them. Here's a snippet from one of my own:

rg_ - List of
trg - Table-of
sid_X - School specified id for entity
fbid - Facebook id
lcode - Log code, used to easily grep log files
url - Stringified URL
uri - Real URI object
pth - Path to directory
fpth - Path to file
fn - lambda/function pointer
mil - Time in milliseconds
linf - list of info regarding something, ex, linfLrg could be [sid_cor, sid_sked, sid_lrg]
minf - map of info regarding something, ex, minf_lrg could be {:sid_cor => sid_cor}
col - column name

Armed with the Hungarian dictionary, every name in your code becomes a rich meaningful entity, instead of something a programmer pulled out of thin air. Often times, with a good enough set of Hungarian tags, naming a variable is a mindless exercise requiring almost no thought. Have a list of paths to files? rg_fpth.

Hungarian also allows for the inclusion of a "qualifier", which is a descriptive name that is usually a one off way to differentiate between variables with common tags. Specifically, say you have two lists of files. Both are going to be named rg_fpth. One list, however is of closed files, so for that you may name one rg_fpth_closed. "But wait, doesn't this then fall back on 'human readable' naming?", you say. Indeed, it does, but as stated above, these qualifiers should be used as one-offs. Often times you will find yourself re-using qualifiers, in which case you probably want to then refactor the qualifiers into their own unique tag. So, for paths to closed files, we might introduce fcpth, and then our variable above becomes rg_fcpth instead of rg_fpth_closed. Either way, we have more consistent and easy to grok names.

More than just Ducks

Ok, I lied a little bit. Hungarian tags can encode more than just Ducks. They can encode whatever the heck you want, as long as it's something that you will re-use and something that can be clearly documented in the Hungarian dictionary.

Ducks are limited: they are based upon what methods an instance responds to. We have many different uses for something that looks and acts like a string. You might have strings that contain HTML, strings that are newline terminated, strings that are the header of a file, and strings that contain comma seperated values. Same duck, different intent. If these concepts are important in your algorithm, and are unambigous, by all means, encode this intent as a Hungarian tag. The sheer power you get from adding a Hungarian tag called html that encodes "this is a string that contains html" is eye-opening, since you can quickly see in your code which strings have markup and which strings do not.

Why Prefixes?

Most of this discussion so far has been about naming variables. Before getting to the other stuff, I'd like to talk briefly about why it's called Hungarian notation.

Well, of course, Charles Simonyi is Hungarian, but that's not really why. You'll notice that the names in Hungarian rely upon a intent coding tag prefix. The key is the fact it's a prefix. Hungarian, like many other spoken languages, puts the noun before the adjective. Unlike English, it's "book blue" instead of "blue book".

I think you'll find that in programming, thinking about things this way makes a whole lot more sense. It also allows much easier navigation in your text editor. There is a nice sense of symmetry (or, if you're a new age Rubyist, joy and beauty) when you start names based upon what they are. Let's take a non-Hungarianized piece of code and just re-write it to use prefix based naming:

first_book_index = 5
last_book_index = 10

first_book_index.upto(last_book_index).each do |my_current_index|
open_books_list[my_current_index].close
end

It might be hard for the unconverted to see, but my Hungarianized brain has a hard time parsing these names because I am used to only having to scan the beginning of the name to know what kind of item the name points to. Really we have two kinds of ducks here, we have indexes, and we have list of open books. Reading the words "first, last, my_current, open_books" is just too confusing:

index_of_first_book = 5
index_of_last_book = 10

index_of_first_book.upto(index_of_last_book).each do |index|
list_books_open[index].close
end

This reads much nicer, especially if you've spoken a language other than English where you can naturally parse out the 'kind' information at the beginning of the name. Now, let's introduce some tags: 'i' will mean index into a list, 'rg' will be list, 'bk' will be our new name for class Book, and 'opbk' will be an open book.

i_first = 5
i_last = 10

i_first.upto(i_last).each do |i|
rg_opbk[i].close
end
Yes, it might look like garbage to a person reading the code for the first time. But once their brain naturally operates on the Hungarian wavelength, this code is extraordinarily easy to read and as a bonus will be the 'one true way' to write this short algorithm, since the names are canonical.

Naming Other Abstractions

Naming classes is easy: just make each class a Hungarian tag. Naming methods is a little different. The returned value of a method should have it's Hungarian tag at the start of the method. Often times methods simply convert one set of objects to another, all of which can be described as a Hungarian tag. (You'll notice this is really the case 90% of the time, once you've Hungarianized your code.) What used to be OpenBooksForGuy, or maybe GetListOfOpenBooksFromFirstNameAndLastName becomes simply rg_obk_from_pnst (where rg, obk are defined as above, and pnst is considered to be the pair of strings first name and last name.)

By putting the return kind as the start of the method name, you can quickly grok the gist of what your methods do by the names: usually they are just mapping a set of tags to another tag. Really, this is the essence of a function anyway, isn't it?

Final Thoughts

No doubt, the concept of Hungarian is a controversial one. It is plagued by three things:
a bad reputation, unintuitiveness, and the fact it is 'all-or-nothing'. Hungarianized code is not code you can just pick up and start understanding on its own. You need the Hungarian dictionary. However, once you have learned the Hungarian tags, the code becomes many times richer and naming variables becomes an exercise devoid of creativity: much like when you type the keyword 'class' or use block syntax. Code becomes more consistent, more patterns emerge, and the amount of complexity your mind can handle goes up severalfold.

However, in my experience, you have to go all or nothing with Hungarian. If you use it occasionally, all of the benefits are sucked out, since you ultimately fall back on the least common denominator of naming styles in your code. The power comes with 100% consistency using the conventions. Hungarian turns the opinion of a 'bad variable name' into an objective 'incorrect variable name'! With this comes power: it becomes possible for a code reviewer to find 'errors' in variable names. This is a good thing, since a certain algorithm will have a smaller set of correct implementations, meaning there is less ambiguity for those who are reading the code for the first time.

Once you start using it, you'll find communication within your team about things to be much more precise. Sure, it will sound like Greek to an outsider when you say "Well, shouldn't the R-S-T be attached to the end of the R-G-C-S-T not the R-G-SID-C-S-T?" but everyone will know exactly what you mean, and you will get a real answer to your question instead of a question *about* your question in return! Truth be told, what's happened is through using Hungarian, you have naturally evolved a Domain Specific Language around your code. And like any worthwhile language, there is a learning curve.

If a Hungarian movement takes hold in a dynamic language community like Ruby, no doubt many canonical Hungarian tags will take hold. The more tags which are considered canonical, the better. In the original Simonyi paper there are many such tags outlined, however they are heavily biased towards legacy languages like C. Ruby is a beautiful language full of many powerful conventions, by moving the power of convention into naming, there will be yet another huge boost in programmer productivity!

* - Often it's useful to have a 'pseudo-meaningful' tag. Once memorized, your brain will just think of it as a 'ccs', but when you are first learning the Hungarian dictionary, being able to bootstrap it in as "[C]an be [C]onverted to [S]tring" definitely helps to speed up memorization.