In my previous blog post, “Hadoop and Disparate Data Stores”, I introduced a project Greenplum is working on that abstracts various storage options within an organization under a unified layer referred to as Unified Storage System (USS). The advantage of USS is that it can help with the Tiering of Storage, a concept that has been around for some time, but is unfamiliar to some. The basic idea is that newer datasets have more relevance in comparison to older datasets, and every dataset’s importance will eventually diminish over time.
Typically, high-demand data is termed “Hot”, while low-demand data is termed “Cold”. This is the idea behind the industry jargon “hotness of data.” Studies have determined that 90% of all datasets drop down to 10% usage within their first 18 days, nearly an exponential drop.
What is Hot/Cold data?
Every company has a different meaning for what Cold or Hot means to them: Company A might deem any dataset used less than twice a day as Cold, over ten times a day as Hot, and everything in between as Warm. In comparison, Company B might deem a dataset used less than ten times an hour as Cold, while everything else is considered Hot.
Cold data is typically moved to cold storage to make space for Hot datasets. Not too long ago, cold storage would live offline, residing on magnetic tapes. This is clearly not an acceptable solution in today’s world: for example, a few years ago an individual would have to pay $5 just to get a year old statement from their bank . In order to pull a customer’s statement out, the bank would have to undertake the painful and expensive process of moving data from tape storage into spinning disks so that it could be read. This is not acceptable in today’s world — such a cost would drive away a customer for good.
This same principle applies to how data must be stored for analytics. Historical models that lack flexibility over time simply aren’t acceptable anymore. Aggregation is one technique to get past this limitation but not all analytics work fits that model. We need to enable our data science team to be able to access as much data as possible, while still keeping our overall costs under control.
Most companies agree that a good tiering strategy should enable switching datasets between Hot and Cold states, without taking any data offline.
Can I have my cake and eat it too?
The Hadoop ecosystem currently does not have any notion of Storage Tiering — all data that needs to be analyzed must be on HDFS. On the other hand, most Hadoop clusters around the world are storage-bound, and routinely require infrastructure teams to add more machines for the sole purpose of providing more HDFS space. Even though the computing power of the cluster might be under-utilized, more machines are needed just to keep up with the increasing storage demand.
Consider this real-life example:
The clickstream data of a large web 2.0 company is ~10TB a day, and needs to be kept online for 180 days. With the 3X HDFS replication rule, our total storage equates to 5.4 PB that takes up space at all times. Even though a dataset usage typically drops to 10% after 30 days, it never reaches zero, so this data cannot be deleted or “copied out” of HDFS, since jobs will continue to be submitted to the dataset. This data typically would be considered “Cold” yet it is still taking up prime space which could be used for Hot Data. If the cluster is almost full, any future request to add 1PB of data implies adding ~40 machines with 24TB drives.
However, if we could move the Cold data over to a potentially slower, but still reliable, storage location holding a single copy of data, we would free up an immense amount of space. Doing the math, 10 TB X 60 days X 3 copies = 1.8 PB of storage space on HDFS, or roughly the storage capacity of 70 machines.
This is exactly what USS enables for an HDFS cluster.
USS and Tiered Storage
USS maintains mount points of data in a central location, and monitors use of the datasets over time. It maintains complete knowledge of the state of the datasets on HDFS, and it uses that knowledge to move data back and forth from Cold Storage to HDFS without requiring user intervention. Either policy based or statistical models can be applied for transparent data movement and mount point resetting. USS also maintains a complete history of the data throughout its lifecycle.
A great example of a remote storage system that can be used to complement HDFS is EMC’s scale-out NAS system, Isilon. Isilon provides storage efficiency through the use of erasure coding, only requiring 1.25X space on disk, rather than the 3X replication required by HDFS.
USS also provides the capability to set up a multi-level hierarchy scheme that addresses Warm data sources as well. In such a case, the slowest and cheapest remote storage would be used for Cold data, HDFS for Hot data, and a remote NFS/S3 instance would store Warm data. Another advantage of USS is that it enables consistency and locking for datasets but I will cover that concept in the next blog post.
USS is just one of the exciting Hadoop projects we’re currently working on in our development of Greenplum HD. So stay tuned, there is a great deal of breakthrough technology we’ll be announcing soon…