Parallel Dataflow Engine

At the heart of the Greenplum Database is the Parallel Dataflow Engine. This is where the real work of processing and analyzing data is done. The Parallel Dataflow Engine is an optimized parallel processing infrastructure that is designed to process data as it flows from disk, from external files or applications, or from other segments over the gNet interconnect. The engine is inherently parallel – its spans all segments of a Greenplum Database cluster and can scale effectively to 1000s of commodity processing cores.

The engine was designed based on supercomputing principles, with the idea that large volumes of data have ‘weight’ (i.e. aren’t easily moved around) and so processing should be pushed as close as possible to the data. In the Greenplum architecture this coupling is extremely efficient, with massive I/O bandwidth directly to and from the engine on each segment. The result is that a wide variety of complex processing can be pushed down as close as possible to the data for maximum processing efficiency and incredible expressiveness.

Greenplum’s Parallel Dataflow Engine is highly optimized at executing both SQL and MapReduce, and does so in a massively parallel manner. It has the ability to directly execute all necessary SQL building blocks, including performance-critical operations such as hash-join, multi-stage hash-aggregration, SQL 2003 windowing (which is part of the SQL 2003 OLAP extensions that are implemented by Greenplum), and arbitrary MapReduce programs.