Thursday, July 30, 2015

Dude I'm (not) Getting fired

Or how to make the conversation be about $200 mistakes rather than $20,000 mistakes.

It appears that the common wisdom about cloud has finally caught up - the main benefit in leveraging the cloud is all about agility and other factors (e.g. cost) are secondary. The ability to go to market quickly, with new prototypes or actual solutions, is critical for competitiveness. The evidence supporting these statements is most visible in the movement of large enterprise organizations into the cloud, and the growing ecosystem of MSPs and supporting businesses.

However, agility, while ignoring costs, is sometimes risky and.. pricy. Here are some horror stories I have heard (and committed), while enjoying the benefits of agility in the cloud:

  • Volumes of Couch Potatoes: To support overnight backend processing in an economical fashion and leverage the dynamic nature of the cloud, we setup an Auto Scaling Group - we automatically provisioned instances & storage to process 100’s of GB’s of data overnight. Once processing is done, all the instances are terminated. However, our bootstrap code the new instances neglected to delete 400GB volumes that were attached to each of 50 instances in the ASG. We were on pace to provision 20TB of data nightly, 600TB monthly - of volumes hanging around and wasting money. Had this gone on for a whole month, ka-ching - $60,000 out the window. 

  • Don’t Cry for Me, RDS: A customer I’ve worked with had a nightly process to help his engineering velocity - loading an anonymized production database into RDS to use for testing and verification in a safe environment. The process was working flawlessly and saving lots of time for engineers, who no longer had to waste time on data dumps. Alas, while the automation was good at spinning up new environments, the cleanup was neglected. Every night a new RDS instance, with the attached resources would be left alone, lacking any attention. 

  • The Price of Redesigning: After a large redesign of an application, a prospect was very happy with the new availability and resiliency of their deployment. The old architecture was relying on Elastic IP addresses - if a server was ever to fail, it’s EIP’s would be reallocated to a new instance which was provisioned automatically. While functioning, this design made failures visible to consumer of services. The new design switched to an on-use service discovery model, which guaranteed seamless transitions in the face of failures.

    As they soon discovered to the tune of $4,000, AWS has applied its social engineering via pricing policies to EIP’s. You can provision IP addresses elastically. However, since IP addresses are a relatively scarce resource, you better use them - you only pay for unused EIP’s! 

  • Last year’s girlfriend: One of the attributes of cloud is that things fail. That’s just a reality. However, you as an architect have plenty of options to safeguard your availability in the face of these failures. For EBS volumes ( which hardly ever fail nowadays, but… still), snapshots allow you to store an offline incremental backup of your data, among other things. If you ever had to recover from a failed volume or recover data deleted accidently, you just love your snapshots. They’re a lifesaver. So we’ve been taking snapshots, often. Every few hours, for the last couple of years. At some point we really stopped loving our old snapshots…can you really justify spending 1000’s on “”pictures”” that are a pale representation (incremental backup) of bygone times? 

A common thread among the war stories above ( all real, with identities slightly modified to avoid embarrassment, and only minor literally latitude exerted) is that everything might appear to be working just fine, at least from a functional point of view, while thousands of dollars were flying out the window, and no one was even aware the window is open, let alone rushing to shut it.

The expensive couch potatoes resulted in a conversation about a $300 mistake. Our CFO, who uses our own product, had noticed a worrying trend lasting 3 days - the EBS cost had grown at a similar rate as our unattached volumes. On the 4th day, when the alert fired, he was certain that something’s wrong - since engineering agreed that anything over 15% increase in unattached storage in 3 days is beyond our normal application of agility in the production environment.

What ensued was a conversation that went like:
CFO: hey guys…I saw this change in the health check - what happened.?
DevOps: Opps.. that doesn’t look right, let me look
DevOps (30 min later): oh.. damn. This ASG <since CFO retold the story> mumbly frutz this and that other thing…
DevOps: We just spent 500$ over last few days for unused volumes.
CFO: Ok.. Stop it.
The conversation could have had a less cordial tone had it occurred too late… and wasn’t about 300$ but rather 3,000 or 30,000$ (if you were worried, no, my pay didn’t get docked).

Also common to the stories above is showing the value of a “watchful eye” that makes sure that conversations occur when the oopsie is still below 1000$ as opposed to > 100,000$. It might make you feel almost like there’s little Jiminy Cricket around, riding shotgun and alerting you to troubles down the road, only a bit less mythical.

The Cricket on your shoulder should help you with
  • Complete visibility to changes in your cloud environment, hour by hour, day by day. When a substantial change occurs, you should have at the tip of your fingers the answers to who and when did the change occur and what are the cost implications and security implications. 
  • Defining policies to proactively monitor your cloud. When managing environments with 100’s, 1000’s of systems and 10’s or 100’s of people operating them, manual inspection is not an option 
  • Put actions at the click of a mouse. The cloud is all about automation, so should your monitoring activities. When policies are violated, remedial action should be a click away. 

Go conquer the world cloud, but carry a Cricket with you, to keep you safe.

Originally posted to my company's engineering blog,