Thursday, July 30, 2015

Dude I'm (not) Getting fired

Or how to make the conversation be about $200 mistakes rather than $20,000 mistakes.

It appears that the common wisdom about cloud has finally caught up - the main benefit in leveraging the cloud is all about agility and other factors (e.g. cost) are secondary. The ability to go to market quickly, with new prototypes or actual solutions, is critical for competitiveness. The evidence supporting these statements is most visible in the movement of large enterprise organizations into the cloud, and the growing ecosystem of MSPs and supporting businesses.

However, agility, while ignoring costs, is sometimes risky and.. pricy. Here are some horror stories I have heard (and committed), while enjoying the benefits of agility in the cloud:

  • Volumes of Couch Potatoes: To support overnight backend processing in an economical fashion and leverage the dynamic nature of the cloud, we setup an Auto Scaling Group - we automatically provisioned instances & storage to process 100’s of GB’s of data overnight. Once processing is done, all the instances are terminated. However, our bootstrap code the new instances neglected to delete 400GB volumes that were attached to each of 50 instances in the ASG. We were on pace to provision 20TB of data nightly, 600TB monthly - of volumes hanging around and wasting money. Had this gone on for a whole month, ka-ching - $60,000 out the window. 

  • Don’t Cry for Me, RDS: A customer I’ve worked with had a nightly process to help his engineering velocity - loading an anonymized production database into RDS to use for testing and verification in a safe environment. The process was working flawlessly and saving lots of time for engineers, who no longer had to waste time on data dumps. Alas, while the automation was good at spinning up new environments, the cleanup was neglected. Every night a new RDS instance, with the attached resources would be left alone, lacking any attention. 

  • The Price of Redesigning: After a large redesign of an application, a prospect was very happy with the new availability and resiliency of their deployment. The old architecture was relying on Elastic IP addresses - if a server was ever to fail, it’s EIP’s would be reallocated to a new instance which was provisioned automatically. While functioning, this design made failures visible to consumer of services. The new design switched to an on-use service discovery model, which guaranteed seamless transitions in the face of failures.

    As they soon discovered to the tune of $4,000, AWS has applied its social engineering via pricing policies to EIP’s. You can provision IP addresses elastically. However, since IP addresses are a relatively scarce resource, you better use them - you only pay for unused EIP’s! 

  • Last year’s girlfriend: One of the attributes of cloud is that things fail. That’s just a reality. However, you as an architect have plenty of options to safeguard your availability in the face of these failures. For EBS volumes ( which hardly ever fail nowadays, but… still), snapshots allow you to store an offline incremental backup of your data, among other things. If you ever had to recover from a failed volume or recover data deleted accidently, you just love your snapshots. They’re a lifesaver. So we’ve been taking snapshots, often. Every few hours, for the last couple of years. At some point we really stopped loving our old snapshots…can you really justify spending 1000’s on “”pictures”” that are a pale representation (incremental backup) of bygone times? 

A common thread among the war stories above ( all real, with identities slightly modified to avoid embarrassment, and only minor literally latitude exerted) is that everything might appear to be working just fine, at least from a functional point of view, while thousands of dollars were flying out the window, and no one was even aware the window is open, let alone rushing to shut it.

The expensive couch potatoes resulted in a conversation about a $300 mistake. Our CFO, who uses our own product, had noticed a worrying trend lasting 3 days - the EBS cost had grown at a similar rate as our unattached volumes. On the 4th day, when the alert fired, he was certain that something’s wrong - since engineering agreed that anything over 15% increase in unattached storage in 3 days is beyond our normal application of agility in the production environment.

What ensued was a conversation that went like:
CFO: hey guys…I saw this change in the health check - what happened.?
DevOps: Opps.. that doesn’t look right, let me look
DevOps (30 min later): oh.. damn. This ASG <since CFO retold the story> mumbly frutz this and that other thing…
DevOps: We just spent 500$ over last few days for unused volumes.
CFO: Ok.. Stop it.
The conversation could have had a less cordial tone had it occurred too late… and wasn’t about 300$ but rather 3,000 or 30,000$ (if you were worried, no, my pay didn’t get docked).

Also common to the stories above is showing the value of a “watchful eye” that makes sure that conversations occur when the oopsie is still below 1000$ as opposed to > 100,000$. It might make you feel almost like there’s little Jiminy Cricket around, riding shotgun and alerting you to troubles down the road, only a bit less mythical.

The Cricket on your shoulder should help you with
  • Complete visibility to changes in your cloud environment, hour by hour, day by day. When a substantial change occurs, you should have at the tip of your fingers the answers to who and when did the change occur and what are the cost implications and security implications. 
  • Defining policies to proactively monitor your cloud. When managing environments with 100’s, 1000’s of systems and 10’s or 100’s of people operating them, manual inspection is not an option 
  • Put actions at the click of a mouse. The cloud is all about automation, so should your monitoring activities. When policies are violated, remedial action should be a click away. 

Go conquer the world cloud, but carry a Cricket with you, to keep you safe.

Originally posted to my company's engineering blog,

Friday, May 29, 2015

Why is this blog so UGLY

AND, hard to read to boot.

The short answer: intentionally.
The rumor: because I can't create a decent UX even if my life depended on it. (Dont believe it).

So why so ugly and hard to read? Stats and Tracking, and selective user targeting.

Lots of people will read the site with the catchy headline, picture rich and attractive looking pages. I do that while waiting in the checkout line, and looking for something to pass the time with.  The marketing industry got a name for it - ClickBait.

I however am not looking for clicks. I'm looking to find which ideas resonate with people. I'm looking to see which entries get passed hand to hand and have an escalated readership.

So, I keep it ugly intentionally.

If you tend to judge books by their cover, please move on. Ugly cover here. Please move off this page in < 5 seconds as to not skew my stats.

If on the other hand, you find the ideas intriguing, by all means, drop me a note, sing me a song or just enjoy and come back when I finally do write another post.

Friday, May 8, 2015

SaaSy Cloudy SSD's

Or, Should you abandon old wisdom

In the world of Public Cloud little is stable, especially Common Wisdom. Following accepted Common Wisdom blindly leads to lost opportunities to capitalize on these enhanced offerings. Case in point - databases on EC2.

Databases are demanding beasts, which  presents a few challenges:

  • Databases tend to be mission critical.
  • Recovery Time Objectives (RTO) and Recovery Point Objectives(RPO) are very stringent

These demands are somewhat in conflict with the fickle nature of public cloud - your servers might disappear or fail with little notice.

In the datacenter this implied highly redundant hardware and expensive and scale up architectures.You have more data, you get a forklift to deliver a bunch more disks for your SAN (or whatever your storage solution is) , and a few more blades/chassis to increase the number of cores in your Oracle RAC cluster.

The first generation mapping of the old datacenter architecture into the public cloud had these general guidelines:

  • Store your data on highly available storage - EBS volumes. To achive the required stroage capacity and IOPS performance, RAID as many volumes as you can manage.
  • To get better performance, shell out extra for Provisioned IOPS. In many cases the cost of provisioned IOPS dominates the cost of storage, over the actual storage costs.
  • EBS, considered the most reliable online storage available, helps ensure RPO. In case of volume failures (which used to be much more frequent) recovery from volume snapshots + binlogs allows for RPO.
  • RTO is achieved by one of the options below (order from most expensive to least):
    • fully replicated hot standby, effectively running 2x the server capacity
    • warm standby for sub minute 
    • no standby, but automating launching a new instance and rebuilding the RAID set for < 10min RTO. 
These strategies became the common wisdom, and any change to this blueprint was considered taboo (e.g see

This blueprint has a few shortcomings:

  • While EBS is very cost effective, the blueprint requires provisioning large amounts of storage upfront, negating the benefit of consumption based pricing
  • Scaling up requires changing instance types, and is inherently limited by the available cloud provider offerings

For the willing to go unpend common wisdom, there are better options enabled by new storage offerings from AWS (SSD, Dense Storage, GP2 Volumes).

The superior blueprint has these characteristics:

  • Store the database on instance storage - leveraging SSD's or dense magnetic storage. 
  • Scale out rather than up - this is partly required because of smaller capacities available on instance storage vs EBS.
  • Leverage other storage options for achieving RTO and RPO objectives.

Conventional wisdom didn't consider instance storage suitable for a database because its ephemeral nature - its contents get lost if the instance gets destroyed. The need for resiliency outweighed performance and cost considerations.
This approach forgoes the very cost effective performance benefits new storage offerings enable - up to 120K IOPS. Achieving this performance on EBS (not that it is achievable) would much costlier.

How is cost effective resilience achieved, when using instance store then? Simple have a backup strategy that:

  • Frequently snapshot the database (e.g. every 4 - 24 hours)
  • Stores sufficient binlogs on EBS to cover 2 full snapshot periods (at least 2 days worth)

In case of a database instance failure, recovery involves loading the last snapshot and applying the binlogs from the resilient EBS volume.

Another consideration when switching to using instance storage, rather than EBS is scaling the storage capacity. Instance storage option range from 40Gig (c3.large) - 2T (on d2.xl). While with EBS you can use larger volumes, or multiple volumes, this option is not available for instance store.

Larger databases than available storage require a sharding strategy - whereby different database instances are deployed to house fragments of the whole dataset, partitioned on logical boundaries.
In the world of SaaS offerings, these boundaries are often apparent - a  user, a tenant (in multi tenant environment etc).
If you current application is not setup to work with shards, there are options to avoid changing your code, e.g. Tesora's Database Virtualization Engine.

As opposed to the ""traditional"" scaling strategy, this approach has the following benefits:

  • Storage (and compute) capacity is provisioned as the need arises, leveraging consumption based pricing.
  • As the data set grows, more compute resources are deployed together with storage
  • There are no bounds for scale.

Comparing performance, when using EBS an instance is limited to 48k IOPS, even with EBS Optimization enabled. To realize this IO performance at least 3 volumes would need to be attached (because of the 20k IOPS limit for a volume).
Compare this to instance store - up to 315k provided by SSD instance storage on i2 instances. For EBS, on top of the charges for storage (0.10$ GB-Month), expect to add up to 1300$ per volume for provisioned IOPS and up to 30$ for EBS-Optimized charges for the instance to realize the throughput to the EBS backend.

For instance store based solution, all the costs are part of the hourly usage charge!

We have been successfully running with this setup ever since AWS released SSD based instance store on the C3 family, and have not lost a single bit of data.

As a side note, unfortunately, RDS does not yet allow you the option of leveraging instance storage, you are forced to attached EBS storage of some type. If you want the price effective performance, you have to roll out your own Db. That said, RDS has a good history of catching up with new practices.

New realities in any realm require re-evaluating historic dogmas, and in the world of cloud, reality changes often, so take a step back, and evaluate if you're squeezing all the performance that's available to you.