Parallel Query Optimizer

Greenplum Database's parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan. It does this using a cost-based optimization algorithm in which it evaluates a vast number of potential plans and selects the one that it believes will lead to the most efficient query execution.

Unlike a traditional query optimizer, Greenplum's optimizer takes a global view of execution across the cluster, and factors in the cost of moving data between nodes in any candidate plan. The benefit of this 'global' query planning approach is that it can use global knowledge and statistical estimates to build an optimal plan once and ensure all nodes execute it in a fully coordinated fashion. This leads to far more predictable results than the alternative approach of 'SQL pushing' snippets that must be replanned at each node.

The resulting query plans contain traditional physical operations - scans, joins, sorts, aggregations, etc - as well as parallel 'motion' operations that describe when and how data should be transferred between nodes during query execution.

Greenplum Database has three kinds of 'motion' operations that may be found in a query plan:

  • Broadcast Motion (N:N) - Every segment sends the target data to all other segments
  • Redistribute Motion (N:N) - Every segment rehashes the target data (by join column) and redistributes each row to the appropriate segment
  • Gather Motion (N:1) - Every segment sends the target data to a single node (usually the master)

Here is an example SQL statement, and the resulting physical execution plan containing 'motion' operations: