Friday, January 20, 2012

To Be (HA) or Not to Be

Or, what does it really mean to be highly available in the cloud


Good IT practices try to maximize SLA conformance, especially around availability. Lessons learned from a disk failure in the Exchange server leading to mail outages and the inevitable fire drills have been deeply embedded into minds. REDUNDANCY EVERYWHERE. power supplies, network connections, disks - if you can put 2 of them suckers in there, you do. Just to keep that machine running. That machine should never fail.

The web has mitigated things somewhat. Rather than a relying on hardware redundancy (where you don't use half your equipment), deployment strategies have evolved. A large pool of web servers can sustain SLA's with some servers failing by utilizing load-balancers to only direct traffic to live web servers. This scheme brings with it worries about session state availability and other share information (e.g database) but nonetheless its progress. Since hardware is now allowed to fail, software developers came up with schemes to work around the failures.  Distributed clustered session stores, MySQL clusters or just replicas gained lots of traction  (circa 2000). Shared Nothing became a new mantra.

The Shared Nothing revolution got to a full swing, and formalized in various best-practice architectures that span the whole application stack, not just the web-server front end. These architectures rely on distributing both load and risk of failure; rather than a single big, expensive server, many small cheap and coordinated ones are used. If more capacity is required, more (small & cheap) servers are added, to match the load.  If one machine fails, the load is redistributed among the surviving. If data is persisted, its never on just one node, it's replicated to a redundant one.
These principles obviously add various complexities (e.g. the CAP Theorem, which captures succinctly the available trade-offs.  Consistency, Availability or Performance - you can have any 2, but not all 3 in any solution). But they provide benefits too (below)

Enter cloud.
If your application has followed the architecture evolution curve, the cloud is your friend. You can scale out as load increases, and obviously, pay for just the capacity you need.  Amazon goes so far as providing  guides (pdf) on how to optimize both your architecture and your cost.

But what if your application is still in the stone age? What if you're application is designed to run on a single server, but you still want to use the cloud?

  • If you need more capacity, you need to resize your server to the next size. Based on published pricing, every step up is pretty painful ($/hr) 0.5, 1.00, 2.00 and on. If your app was scaling out, you'd go from 1$ to 1.5$ rather than 2$.
  • If your provider decided to reboot your instance, you'd be scrambling to stand up another server, where they're not being rebooted (andyou probably didn't really build deployment automation, did you?) and then take care of the plumbing (move IP's or update DNS and all that fun). With an evolved architecture, you'd care about a few of your instances, but just to the extent that not all of the instances for the same function will be restarted at the same time. Your auto-scaling infrastructure could potentially just make magic happen
  • That availability figure (99.95% for amazon) could actually get put to practice and you hit that 0.05% chance. Those 3.6 hours a month or that day and a half a year hits and you're server goes puff....together with your app. The refrain is probably familiar by now, so I won't repeat it a 3 time.
While these are obviously risks present in your own data center, not just in the cloud, they're out of your control in the cloud.
The take away is probably pretty clear - but I like to be explicit. To be happy and prosperous in the cloud, you have to evolve, and forget about your traditional notions of HA.