Blog
Latest Posts
Diverging views on Big Data density
One of Greenplum's advisors is UC Berkeley Professor Joe Hellerstein. In a recent post on his blog Data Beta he shares some thoughts about big data deployments on Hadoop and Greenplum comparing and contrasting the two approaches.
A few choice extracts:
- Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of every single Map or Reduce stage to disks, before reading it right back in. (I describe this to my undergrads as the “regurgitation approach” to fault tolerance.) By contrast, classic MPP database approaches (like Graefe’s famous Exchange operator) are wildly optimistic and pipeline everything, requiring restarts of deep dataflow pipelines in the case of even a single fault.
- The Google MapReduce pessimistic fault model requires way more machines, but the more machines you have, the more likely you are to see a fault, which will make you pessimistic….
- It sounds wise to only play the Google regurgitation game when the cost of staging to disk is worth the expected benefit of enabling restart. Can’t this be predicted reasonably well, so that the choice of pipelining or snapshotting is done judiciously?
That last point hits the nail on the head. If a query would run for 1 minute without ‘regurgitation’ and 40 minutes with it (or require 40x the hardware), you’d probably be better off just running it straight and allow the query to automatically restart if it fails. For longer running queries, a very selective amount of mid-query checkpointing (i.e. not full regurgitation at every step) could start to makes sense, but finding the right balance is really an optimization problem based on the expected runtime and characteristics of the query. If only we had a smart query optimizer that we could use to make those kind of decisions… :)
- Teradata Taking Aim at Our Enterprise Data Cloud™ Initiative
- Beyond Rows and Columns: Greenplum’s Polymorphic Data Storage™ -- Part 2
- Beyond Rows and Columns: Greenplum’s Polymorphic Data Storage™ -- Part 1
- Greenplum Live! @Hadoop World ‘09
- When New is Old - Part 2
Archive
2010
- January
2009
- December
- November
- October (4)
- September (4)
- June (1)
- May (2)
- April (3)
- March (1)
- February (4)
- January (2)
2008
- December (4)
- November (3)
- October (3)
- September (4)
- August (3)
- July (2)
- June (2)
- May (1)
- April (1)
- March (2)
- February (1)
- January (2)


Add A Comment