Saturday, March 16, 2013

Your Customer's pain is not always yours

Or, the one and the many

The inspiration for this post was a discussion with Q&A folks about how Crowbar should behave when failures are encountered while configuring  the storage subsystem on a node. Well, that and a binge of reading and listening to folks talking about Lean Startups and the importance of solving real customer issues.

The Q&A engineer was adamant, that on a server with 24 drives, Crowbar should be just ignoring a single failed drive, and just use the other 23. For the use case he was trying to solve, this might make sense. He had limited resources (only a handful of servers), and needed to quickly turn up a cluster. The fact that Crowbar flagged the server with the bad disk as having a problem, and refused to use it was nothing but annoyance to him.

Crowbar was designed to enable DevOps operations at very large scale. In a recent customer install (more about it in another post, i hope), the customer purchased 5 racks of servers, rather than 5 servers, among them 20 servers dedicated for storage. Each of those servers has 40 separate disks attached to it (it's kinda cool hardware, checkout the C8000XD while the link works).
The calculus that applies to 5 servers does not apply to 5 racks of servers.

Just imagine this scenario - you just spent the last 2 hours bringing this mass of bare-metal servers into an functioning Openstack Swift cluster (yea.. you can do that in about 2 hours with Crowbar). Then you go and inspect the cluster, and discover that rather than having the 20x40=800 disks.... you're missing 2 or 3. Now go find them, and figure out what the heck. That is a real pain.

The pain that customers experience should not be materially different than the pain the Q&A "customer" experiences in his scenario. 

The design of Crowbar is intended to address the real customer pain.

When dealing with large installations, in which the paramount importance is the delivering the desired performance at the desired TCO (O=operation, not necessarily Ownership, but more on that some other day).  In an environment with 10's or 100's or 1000's of nodes, partial node failure which ends up impacting the performance of the overall system is not acceptable. Throw the rutted apple out, and save the time and cost of trying to salvage bits of good flesh that might still be in there. The overall system will react intelligently and recover, rather than hiccup inexplicably a blew through SLA's.

The post is getting long, so time for some parting thoughts

  • Q&A is an important function - if done right they're your friendliest customers, they'll patiently enter very informative problem reports, and give you access to their environment. However, make sure that the enhancements they seek actually reflect the pain that a real customer will have
  • As deployments - both physical and in the cloud grew in size, the operations calculus changes dramatically. Its easier/faster/cheaper to through out the bad apple, rather than analyze what got it sick.
Align your stars and do the math before you take wasteful action.