Thursday, January 31, 2013

Democratizing Storage

Or, you control your bits

Traditional storage solutions gravitated towards some central bank of disks - SAN, NAS, Fiber Channel, take your pick, they share a few traits that are not very democratic:

  • They cost lots, and large parts of the cost is Intellectual Property embedded in the solution (i.e. the markup on the underlying hardware is huge)
  • The OEM makes lots of trade-off decisions for you - e.g. ratio of controllers to disks, replication policies and lots of others (most OEM's graciously expose some options that the user can control, but those are just a tiny fraction)
  • They typically require 'forklift updates' - if you use up your capacity, call the forklift to install the next increment, which typically requires a forklift worth of equipment
On the plus side, in general those type of systems provide you with reliable, performant storage solution (based on the $$$ you spend, you get more of either qualities).

But, in the world of large scale deployments based on open source software, the traditional storage solutions are an anachronism. 
 traditional storage solutions are an anachronism. 

There are now a slew of distributed storage solutions that solve the pain points of the old solutions and democratize storage. 

The different solution differ along a few axis:
  • Type of access they provide - Block (e.g. iSCSI), File system (e.g. similar to ext3 or other POSIX filesystems),  Object (e.g. Amazon S3, Swift) or tailored API (e.g. Hadoop)
  • Is it a  generic storage solution, or tailored for a higher purpose -
    • Ceph offers lots of different access methods, and can be used for pretty much any type of storage (e.g. Block, file system, and object)
    • Hadoop FS is tailor made for .... you've guessed it - Hadoop workloads
    • Swift and S3 only offer Object storage semantics
    • SheepDog is tailor made for virtualization based workloads.
  • The complexity of their metadata handling- or in simpler terms, if you have a blob of bits, how complex is the name you can give it? and if you have lots of these blobs, how smart is the technology in handling all these names?
    • Swift choose a very simple naming scheme - you have Accounts which contain Containers which contain Objects. That's it 2 level deep naming scheme. This simplicity allows swift to be very smart about replicating this information, providing high availability and performance.
    • Hadoop provides a full directory structure, similar to traditional filesystems (e.g Dos/Fat or linux  ext3). But it's a bit dismal about replicating it (better in Hadoop 2.0).  It relies on another Apache project (ZooKeeper) to maintain and synchronize the little beasts.
    • Ceph takes a mixed approach - the underlying object store library has Pools and Objects, each having a simple name (pools also have policies attached). But it also provide rich and capable additional metadata services
So, you have lots of options, that are much more cost effective and capable. But, you haven't found the panacea of storage quite yet. These solutions have their dark shadows too:
  • In many cases you get a Swiss army knife with lots of blades to play with, and get hurt by. Those tradeoffs that OEMs perform for you in the old solutions... you have to evaluate and select (or hire consultants)
  • The solutions above provide the software, but as famously said - the cloud doesn't run on water vapor - you still need to pick and buy the hardware (or buy a packaged solution)
  • It's all on you... no vendor to call up and nag with support calls (unless you pay a support vendor / solution provider).

Are the shadows scary? A bit. Can they be tackled.... with a bit of work and research, absolutely! and it's worth it!

In a follow up post, I'm planning to describe some of the hardware choices and considerations that go into deploy a Petabyte hardware platform for a distributed storage deployment, based on a recent project.