Thursday, January 31, 2013

Democratizing Storage

Or, you control your bits

Traditional storage solutions gravitated towards some central bank of disks - SAN, NAS, Fiber Channel, take your pick, they share a few traits that are not very democratic:

  • They cost lots, and large parts of the cost is Intellectual Property embedded in the solution (i.e. the markup on the underlying hardware is huge)
  • The OEM makes lots of trade-off decisions for you - e.g. ratio of controllers to disks, replication policies and lots of others (most OEM's graciously expose some options that the user can control, but those are just a tiny fraction)
  • They typically require 'forklift updates' - if you use up your capacity, call the forklift to install the next increment, which typically requires a forklift worth of equipment
On the plus side, in general those type of systems provide you with reliable, performant storage solution (based on the $$$ you spend, you get more of either qualities).

But, in the world of large scale deployments based on open source software, the traditional storage solutions are an anachronism. 
 traditional storage solutions are an anachronism. 

There are now a slew of distributed storage solutions that solve the pain points of the old solutions and democratize storage. 

The different solution differ along a few axis:
  • Type of access they provide - Block (e.g. iSCSI), File system (e.g. similar to ext3 or other POSIX filesystems),  Object (e.g. Amazon S3, Swift) or tailored API (e.g. Hadoop)
  • Is it a  generic storage solution, or tailored for a higher purpose -
    • Ceph offers lots of different access methods, and can be used for pretty much any type of storage (e.g. Block, file system, and object)
    • Hadoop FS is tailor made for .... you've guessed it - Hadoop workloads
    • Swift and S3 only offer Object storage semantics
    • SheepDog is tailor made for virtualization based workloads.
  • The complexity of their metadata handling- or in simpler terms, if you have a blob of bits, how complex is the name you can give it? and if you have lots of these blobs, how smart is the technology in handling all these names?
    • Swift choose a very simple naming scheme - you have Accounts which contain Containers which contain Objects. That's it 2 level deep naming scheme. This simplicity allows swift to be very smart about replicating this information, providing high availability and performance.
    • Hadoop provides a full directory structure, similar to traditional filesystems (e.g Dos/Fat or linux  ext3). But it's a bit dismal about replicating it (better in Hadoop 2.0).  It relies on another Apache project (ZooKeeper) to maintain and synchronize the little beasts.
    • Ceph takes a mixed approach - the underlying object store library has Pools and Objects, each having a simple name (pools also have policies attached). But it also provide rich and capable additional metadata services
So, you have lots of options, that are much more cost effective and capable. But, you haven't found the panacea of storage quite yet. These solutions have their dark shadows too:
  • In many cases you get a Swiss army knife with lots of blades to play with, and get hurt by. Those tradeoffs that OEMs perform for you in the old solutions... you have to evaluate and select (or hire consultants)
  • The solutions above provide the software, but as famously said - the cloud doesn't run on water vapor - you still need to pick and buy the hardware (or buy a packaged solution)
  • It's all on you... no vendor to call up and nag with support calls (unless you pay a support vendor / solution provider).

Are the shadows scary? A bit. Can they be tackled.... with a bit of work and research, absolutely! and it's worth it!

In a follow up post, I'm planning to describe some of the hardware choices and considerations that go into deploy a Petabyte hardware platform for a distributed storage deployment, based on a recent project.
 






Tuesday, January 22, 2013

Openstack 'secret sauce'

Or, some less than obvious reasons why refactoring is "A Good Thing"

At a meetup tonight, someone challenged me to explain what's really good about Openstack. This was in the context of Openstack-BostonChef-Boston discussion about Openstack and the effort around Community deployment cookbooks, and an approach that uses Pull From Source (which I'll post about in a later date).

While I could have spent lots of time describing the CI testing infrastructure and the great work done by Monty and his team, frankly that's not unique to Openstack. It's an enabler for lots of other things.

To me, one of the primary sources of excellence in Openstack is the courage to refactor.

Not too long ago, there were only 2 services - Nova for Compute, and Swift for Object storage. In Grizzly, through large efforts, there are dedicated services with clear focus and dedicated team passionate about the technology area each services.

One of the first refactors was Keystone. Both Nova and Swift had their own approach for providing authorization, authentication and separation between tenants. During the Diablo release, the keystone service was carved off to provide a centralized function for these capabilities.

While the immediate end user benefits are clear - a single signon system, what the discussion tonight helped me put into words is the benefit to the community and the overall vibrancy of Openstack. I'll keep the suspense, and provide another example.

The Cinder block storage service in the upcoming Grizzly release started itself deep inside nova, as nova-volume. In that location, it shared some code, but mostly Project lead (PTL) and developers with nova. As a stand alone project it has a separate (though somewhat overlapping) sub-community dedicated to storage technology.
( I'd be remiss if I don't mention Quantum, the Software Defined Networking service, which started it's life as nova-network and followed a similar path during the Essex release)

Is the picture emerging?

As technologies areas are identified within their current "home", they're spawned into their own project, under the Openstack umbrella. This allows a community of enthusiasts to form around the project and drive its development.

Going back to Cinder, as a poster child of success - now that there's a focused block storage related community forming around it, vendors are getting engaged. Over 11 vendors have contributed at least their 'drivers' (a driver allows Cinder to "talk" the unique protocol to a particular back end storage platform). In the process, Cinder it self is becoming better.

Would the storage vendors have had the incentive to contribute to nova-volume? maybe. Is Openstack stronger by creating a focused set of PTL, core code reviewers and engaged contributes that only care about storage - I think so.
(Again, not to neglect Quantum... exactly the same result ! and Keystone too)

OpenStack's willingness to refactor encourages deep experts to join the project because they get to take ownership of code.  That 'secret sauce' drives excellence and community growth.