Blog

Latest Posts

03.16.2009 :: Ben Werther
Category:: March

Product Perspective: Greenplum Reinvents Data Loading

What do you do when your nightly load is taking longer and longer — and starts creeping into the daylight hours? How do you keep up with real-time streams that come in more quickly than your database can consume them? How can you keep up with exploding data volumes that overwhelm your ability to load and process them?

For Greenplum customers, these concerns are only a memory. Today we’re announcing the technology that is allowing customers like Fox Interactive, iCrossing, NYSE, Nasdaq, Reliance Communications and Bakrie Telecom to load data at the fastest rates in the industry — whether doing nightly bulk loads or continuous micro-batching throughout the day. At Fox Interactive Media (the parent company of MySpace that is responsible for ad monetization for the site) they are loading consistently at 4 Terabytes an hour. And this isn’t some controlled benchmark — this is the real-world rate for complex ELT loading into one of the largest data warehouses in the world.

This is an order of magnitude faster than most Oracle warehouses, 8x faster than the 500GB/hr that Netezza touts, and well ahead of anyone else in the industry. At these rates, loading 1TB of data takes 15 minutes (vs 5 hours at a more conventional rate of 200GB/hr).

How do we do it? The heart of the Greenplum Database is a state-of-the-art parallel processing engine — what we call our ‘Parallel Dataflow Engine’. When users run SQL queries and MapReduce programs, these are executed in this engine, and it coordinates processing and data movement between 10s or 100s of servers to achieve the best performance possible. We eliminate every sequential bottleneck to make sure that everything is running in parallel as efficiently as possible.

We took the same approach to reinvent loading. We extract full parallelism from multiple sources (files, database, applications, ETL/DI systems, etc) to simultaneously stream them to all nodes of a Greenplum Database cluster in parallel. We call this technology ‘MPP Scatter/Gather Streaming’ (aka SG Streaming). The secret sauce is the ability to split sources into pieces that we can ‘scatter’ across 100s or 1000s of streams that flow directly to all database nodes. As the data arrives, we apply any in-flight transformation and cleansing before writing it to disk (in parallel, automatically partitioned across nodes, with optional on-disk compression). There are no sequential bottlenecks, and performance increases linearly with the number of nodes in the cluster (with no theoretical maximum). You can find more details here.

This technology comes built in to the Greenplum product at no extra charge. It powers our ‘external table’ interface for loading and streaming external data sources, as well as our ‘gpload’ command line loading utility. Not bad — as easy to use as it is breathtakingly fast.

Add A Comment

Your Name(*):
Comment(*):