Blog
Latest Posts
04.14.2009 :: Ben Werther
Category:: April
Product Perspective: SQL and MapReduce. The choice is yours.
Interesting article in Computerworld about a new research paper comparing the performance of MapReduce and MPP databases. The paper by DeWitt, Stonebreaker, et al is entitled "A Comparison of Approaches to Large-Scale Data Analysis," and will be presented at SIGMOD.
The paper pits Hadoop against Stonebreaker’s Vertica and an unnamed ‘DBMS-X’ in a set of tests on a 100 node cluster. Their tests show that both databases are significantly faster than Hadoop -- i.e. on a set of 5 tasks, DBMS-X was 3.2 faster than Hadoop, and Vertica was 2.3 times faster than DBMS-X. Leaving aside the not-so-subtle Vertica pitch here, their conclusion is that the MapReduce model is inherently wasteful and less efficient than an MPP database architecture.
To some extent they have a point. Hadoop’s MapReduce implementation is anything but speedy — i.e. it is rumored to be an order of magnitude slower than Google’s internal MapReduce implementation. Users of Hadoop need to be willing to throw as much as 10 times the hardware at a problem to match any of the better MPP database implementations. That means buying 1000 Hadoop servers to keep pace with 100 Greenplum servers. That’s an enormous cost in terms of power, capital expenditure, datacenter space, and more.
Setting aside performance questions for a moment, there are good reasons why many programmers prefer to express their problems in MapReduce rather than SQL. And likewise why DBAs and analysts generally prefer SQL rather than getting their hands dirty writing code. Each is trained to approach problems in a certain way, and they prefer the mode of expression that best fits with their skills and experience.
The good news is that Hadoop isn’t synonymous with MapReduce from a performance perspective. Here at Greenplum we’ve implemented MapReduce natively on our parallel dataflow engine, using the same building blocks used to execute SQL at high performance and massive scale. That means that user get the best of both worlds — the ability to analyze their data using SQL, MapReduce or both together in the same program — with industry-leading performance in either case. The choice is yours.
The paper pits Hadoop against Stonebreaker’s Vertica and an unnamed ‘DBMS-X’ in a set of tests on a 100 node cluster. Their tests show that both databases are significantly faster than Hadoop -- i.e. on a set of 5 tasks, DBMS-X was 3.2 faster than Hadoop, and Vertica was 2.3 times faster than DBMS-X. Leaving aside the not-so-subtle Vertica pitch here, their conclusion is that the MapReduce model is inherently wasteful and less efficient than an MPP database architecture.
To some extent they have a point. Hadoop’s MapReduce implementation is anything but speedy — i.e. it is rumored to be an order of magnitude slower than Google’s internal MapReduce implementation. Users of Hadoop need to be willing to throw as much as 10 times the hardware at a problem to match any of the better MPP database implementations. That means buying 1000 Hadoop servers to keep pace with 100 Greenplum servers. That’s an enormous cost in terms of power, capital expenditure, datacenter space, and more.
Setting aside performance questions for a moment, there are good reasons why many programmers prefer to express their problems in MapReduce rather than SQL. And likewise why DBAs and analysts generally prefer SQL rather than getting their hands dirty writing code. Each is trained to approach problems in a certain way, and they prefer the mode of expression that best fits with their skills and experience.
The good news is that Hadoop isn’t synonymous with MapReduce from a performance perspective. Here at Greenplum we’ve implemented MapReduce natively on our parallel dataflow engine, using the same building blocks used to execute SQL at high performance and massive scale. That means that user get the best of both worlds — the ability to analyze their data using SQL, MapReduce or both together in the same program — with industry-leading performance in either case. The choice is yours.
- Greenplum Days!
- MAD Skills for Changing Times
- Teradata Taking Aim at Our Enterprise Data Cloud™ Initiative
- Beyond Rows and Columns: Greenplum’s Polymorphic Data Storage™ -- Part 2
- Beyond Rows and Columns: Greenplum’s Polymorphic Data Storage™ -- Part 1
Archive
2010
2009
- December
- November
- October (4)
- September (4)
- June (1)
- May (2)
- April (3)
- March (1)
- February (4)
- January (2)
2008
- December (4)
- November (3)
- October (3)
- September (4)
- August (3)
- July (2)
- June (2)
- May (1)
- April (1)
- March (2)
- February (1)
- January (3)


Add A Comment
Dave Menninger, Vertica
Hi Ben,
If you read the conclusion of the paper it states that "there is a lot to learn from both kinds of systems." I would think you would endorse that kind of comment about MapReduce. Also, it would be interesting if Greenplum ran both sides of the benchmark using their product as both a point of comparison with the published results and as a comparison of both techniques within the same environment.