Greenplum Technologies

Technology Overview

Greenplum Unified Analytics Platform (UAP)

Greenplum Unified Analytics Platform (UAP) is a single unified platform enabling agile Big Data Analytics by fusing the co-processing of structured and unstructured data with a productivity engine to empower data science teams. Greenplum UAP delivers a simple entry point for organizations seeking business agility through high-performance data analytics, enabling them to become data savvy, to develop a data-centric culture and to evolve into data-driven companies.

The Greenplum UAP solution includes Greenplum Database, Greenplum HD, Greenplum Chorus, Greenplum Command Center, and Greenplum DCA, as well as the talent and services of the Greenplum Data Scientist team. The below diagram will highlight all Greenplum products and explains how all parts work together to bring the power to enable Big Data Analytics.

Hover over the UAP diagram with your mouse to learn about the technologies behind Greenplum products.

Greenplum Database

Greenplum Database™

Greenplum Database utilizes a shared-nothing MPP (massively parallel processing) architecture that has been designed from the ground up for BI and analytical processing using commodity hardware. In this architecture, data is automatically partitioned across multiple 'segment' servers, and each 'segment' owns and manages a distinct portion of the overall data. All communication is via a network interconnect -- there is no disk-level sharing or contention to be concerned with (i.e. it is a 'shared-nothing' architecture).

Most of today’s general-purpose relational database management systems (e.g. Oracle, Microsoft SQL Server) were originally designed for Online Transaction Processing (OLTP) applications. These databases utilize 'shared-disk' or 'shared-everything' architectures that are optimized for high transaction rates at the expense of individual query performance and parallelism.

Greenplum Database’s shared-nothing MPP architecture provides every segment with a dedicated, independent high-bandwidth channel to its disk. The segment servers are able to process every query in a fully parallel manner, use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates. The degree of parallelism and overall scalability that this allows far exceeds general-purpose database systems.

Greenplum Database's architecture is the results of many years of design and implementation by a team of some of the world's leading database and scalable systems experts.

gNet™ Software Interconnect

In shared-nothing MPP database systems, data often needs to be moved whenever there is a join or an aggregation process for which the data requires repartitioning across the segments. As a result, the interconnect serves as one of the most critical components within Greenplum Database. Greenplum’s gNet interconnect optimizes the flow of data to allow continuous pipelining of processing without blocking on all nodes of the system. The gNet interconnect is tuned and optimized to scale to 10,000s of processors and leverages commodity Gigabit Ethernet and 10GigE switch technology.

At its core, the gNet software interconnect is a supercomputing-based ‘soft-switch’ that is responsible for efficiently pumping streams of data between motion nodes during query-plan execution. It delivers messages, moves data, collects results, and coordinates work among the segments in the system. It is infrastructure underpinning the execution of motion nodes that occur within parallel query plans on the Greenplum system.

Within the execution of each node in the query plan, multiple relational operations are processed by pipelining. For example, while a table scan is taking place, rows selected can be pipelined into a join process. Pipelining is the ability to begin a task before its predecessor task has completed, and this ability is key to increasing basic query parallelism. Greenplum Database utilizes pipelining whenever possible to ensure the highest-possible performance.

Greenplum MapReduce

The Power of Parallel Computing for Large-Scale Data Warehousing and Analytics

MapReduce has been proven as a technique for high-scale data analysis by Internet leaders such as Google and Yahoo. Greenplum gives enterprises the best of both worlds – MapReduce for programmers and SQL for DBAs – and will execute both MapReduce and SQL directly within Greenplum’s parallel dataflow engine, which is at the heart of the Greenplum Database.

Greenplum MapReduce enables programmers to run analytics against petabyte-scale datasets stored in and outside of the Greenplum Database. Greenplum MapReduce brings the benefits of a growing standard programming model to the reliability and familiarity of the relational database. The new capability expands the Greenplum Database to support MapReduce programs.

“Greenplum has been mastering the use of a parallel dataflow engine as the heart of our core product, Greenplum Database. With several years of experience developing and deploying Greenplum Database at companies large and small, we know how to build highly scalable data solutions. Adding MapReduce means our customers will be able to use a leading-edge new technology on a stable, reliable foundation.”
Luke Lonergan, CTO, Greenplum

Key Driver: Need For Petabyte-Scale Data Analytics

Greenplum customers have been involved in an early-access program utilizing Greenplum MapReduce for advanced analytics. For example, LinkedIn is using Greenplum Database for new innovative social networking features such as “People You May Know” and is evaluating Greenplum MapReduce as a way to develop compelling analytics products faster. A primary benefit of the new capability is that customers can combine SQL queries and MapReduce programs into unified tasks that are executed in parallel across hundreds or thousands of cores.

“The integration of MapReduce into Greenplum Database creates new ways to manage our text analysis efforts. What previously would require us to take data out of the database or write complex SQL queries can now be simplified into a few lines of code.”
Roger Magoulas, Research Director, O’Reilly Media

"The most exciting aspect of MapReduce is the excitement it is generating. It's attracting talented programmers -- many of whom don't want to buy or use SQL databases -- and enabling them to wrangle enormous data sets without leaving their familiar programming paradigms. Any movement that brings that much compute power to a larger talent base has the potential to produce game-changing results."
Joe Hellerstein, Professor, UC Berkeley

Key Features of Greenplum MapReduce
  • Combine SQL & MapReduce
  • Process any type of data wherever it lives
  • Enterprise-level integration and support

Parallel Query Optimizer

Greenplum Database's parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan. It does this using a cost-based optimization algorithm in which it evaluates a vast number of potential plans and selects the one that it believes will lead to the most efficient query execution.

Unlike a traditional query optimizer, Greenplum's optimizer takes a global view of execution across the cluster, and factors in the cost of moving data between nodes in any candidate plan. The benefit of this 'global' query planning approach is that it can use global knowledge and statistical estimates to build an optimal plan once and ensure all nodes execute it in a fully coordinated fashion. This leads to far more predictable results than the alternative approach of 'SQL pushing' snippets that must be re-planned at each node.

The resulting query plans contain traditional physical operations - scans, joins, sorts, aggregations, etc. - as well as parallel 'motion' operations that describe when and how data should be transferred between nodes during query execution.

Greenplum Database has three kinds of 'motion' operations that may be found in a query plan:

  • Broadcast Motion (N:N) - Every segment sends the target data to all other segments
  • Redistribute Motion (N:N) - Every segment rehashes the target data (by join column) and redistributes each row to the appropriate segment
  • Gather Motion (N:1) - Every segment sends the target data to a single node (usually the master)

Here is an example SQL statement, and the resulting physical execution plan containing 'motion' operations:

Multi-Level Fault Tolerance

Greenplum Database is architected to have no single point of failure. Internally the system utilizes log shipping and segment-level replication to achieve redundancy, and provides automated failover.

The system provides multiple levels of redundancy and integrity checking. At the lowest level, Greenplum Database utilizes RAID-0+1 or RAID-5 storage to detect and mask disk failures. At the system level, Greenplum continuously replicates all segment and master data to other nodes within the system to ensure that the loss of a machine will not impact the overall database availability. Greenplum also utilizes redundant network interfaces on all systems, and specifies redundant switches in all reference configurations.

Figure: Segment Configurations – 4 Servers, 3 Active Segments per Server

Figure: 1 Server Failure – 3 Servers, 4 Active Segments per Server

The result is a system that reliability requirements of some of the most mission critical operations in the world.

Polymorphic Data Storage™

Traditionally relational data has been stored in rows - i.e. as a sequence of tuples in which all the columns of each tuple are stored together on disk. This has a long heritage back to early OLTP systems that introduced the 'slotted page' layout that is still in common use today. However analytical databases tend to have different access patterns than OLTP systems. Instead of seeing many single-row reads and writes, analytical databases must process larger more complex queries that touch much larger volumes of data - i.e. read-mostly with big scanning reads and infrequent batch appends of data.

Vendors have taken a number of different approaches to meeting these needs. Some have optimized their disk layouts to eliminate the OLTP fat and do smarter disk scans. Others have turned their storage sideways (literally) with column-stores - i.e. the decades old idea of 'vertical decomposition' demonstrated to good success by Sybase IQ, and now reimagined by a raft of newer vendors. Each of these approaches has proven to have sweet-spots where they shine, and others where they do a less admirable job.

Rather than advocating for one approach or the other, we've built in the flexibility so that customers can choose the right strategy for the job at hand. We call this Polymorphic Data Storage™. For each table (or partition of a table), the DBA can select the storage, execution and compression settings that suit the way that table will be accessed. With Polymorphic Data Storage™, the database transparently abstracts the details of any table or partition, allowing a wide variety of underlying models:

  • Read/Write Optimized — Traditional 'slotted page' row-oriented table (based on PostgreSQL's native table type), optimized for fine-grained CRUD operations.
  • Row-Oriented / Read-Mostly Optimized -- Optimized for read-mostly scans and bulk append loads. DDL allows optional compression ranging from fast/light to deep/archival.
  • Column-Oriented / Read-Mostly Optimized — Added as a feature in Greenplum's latest 3.3.4 release, providing a true column-store just by specifying 'WITH (orientation=column)' on a table. Data is vertically partitioned, and each column is stored in a series of large densely-packed blocks that can be efficiently compressed from fast/light to deep/archival (and tend to see notably higher compression ratios than row-oriented tables). Performance is excellent for those workloads suited to column-store — Greenplum's implementation only scans those columns required by the query, doesn't have the overhead of per-tuple IDs, and does efficient early materialization using an optimized 'columnar append' operator.

Greenplum's Polymorphic Data Storage™ really shines when combined with Greenplum's multi-level table partitioning. With Polymorphic Data Storage™, customers can tune the storage types and compression settings of different partitions within the same table. I.e. A single partitioned table could (for example) have older data stored as 'column-oriented with deep/archival compression', more recent data as 'column-oriented with fast/light compression', and the most recent data as 'read/write optimized' to support fast updates and deletes.

Parallel Dataflow Engine

At the heart of the Greenplum Database is the Parallel Dataflow Engine. This is where the real work of processing and analyzing data is done. The Parallel Dataflow Engine is an optimized parallel processing infrastructure that is designed to process data as it flows from disk, from external files or applications, or from other segments over the gNet interconnect. The engine is inherently parallel – its spans all segments of a Greenplum Database cluster and can scale effectively to 1000s of commodity processing cores.

The engine was designed based on supercomputing principles, with the idea that large volumes of data have ‘weight’ (i.e. aren’t easily moved around) and so processing should be pushed as close as possible to the data. In the Greenplum architecture this coupling is extremely efficient, with massive I/O bandwidth directly to and from the engine on each segment. The result is that a wide variety of complex processing can be pushed down as close as possible to the data for maximum processing efficiency and incredible expressiveness.

Greenplum’s Parallel Dataflow Engine is highly optimized at executing both SQL and MapReduce, and does so in a massively parallel manner. It has the ability to directly execute all necessary SQL building blocks, including performance-critical operations such as hash-join, multi-stage hash-aggregation, SQL 2003 windowing (which is part of the SQL 2003 OLAP extensions that are implemented by Greenplum), and arbitrary MapReduce programs.

MPP Scatter/Gather Streaming™ Technology

Greenplum's new MPP Scatter/Gather Streaming™ (SG Streaming™) technology eliminates the bottlenecks associated with other approaches to data loading, enabling lightning-fast flow of data into the Greenplum Database for large-scale analytics and data warehousing. Greenplum customers are achieving production-loading speeds of over four terabytes per hour with negligible impact on concurrent database operations.

Scatter/Gather Streaming:

  • Manages the flow of data into all nodes of the database
  • Does not require additional software or systems
  • Takes advantage of the same Parallel Dataflow Engine nodes in Greenplum Database

Greenplum utilizes a 'parallel-everywhere' approach to loading in which data flows from one or more source systems to every node of the database without any sequential choke points. This differs from traditional “bulk loading” technologies, used by most mainstream database and MPP appliance vendors that push data from a single source, often over a single or small number of parallel channels, and result in fundamental bottlenecks and ever-increasing load times. Greenplum's approach also avoids the need for a 'loader' tier of servers, as required by some other MPP database vendors that can add significant complexity and cost while effectively bottlenecking the bandwidth and parallelism of communication into the database.

Greenplum’s SG Streaming™ technology ensures parallelism by 'scattering' data from all source systems across 100s or 1000s of parallel streams that simultaneously flow to all nodes of the Greenplum Database. Performance scales with the number of Greenplum Database nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations. Data can be transformed and processed in-flight, utilizing all nodes of the database in parallel, for extremely high-performance ELT (extract-load-transform) and ETLT (extract-transform-load-transform) loading pipelines. Final 'gathering' and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally compressed. This technology is exposed to the DBA via a flexible and programmable "external table" interface and a traditional command-line loading interface.

Greenplum HD

Greenplum HD

Greenplum HD is a 100-percent open-source certified and supported version of the Apache Hadoop stack. It includes Hadoop Distributed File System (HDFS), MapReduce, Hive, Pig, HBase, and Zookeeper. Greenplum HD’s packaged Hadoop distribution removes the pain associated with building out a Hadoop cluster from scratch, which is required with other distributions. Greenplum has also incorporated a pluggable storage layer to Hadoop enabling customers to leverage best of breed storage options without requiring changes to existing applications.


Figure: Greenplum HD Software Architecture

Hadoop Components


Hadoop Distributed File System (HDFS)

HDFS is a Java-based file system that provides scalable and reliable data storage. With industry installation in the thousands of nodes, HDFS has proven to be a solid foundation for any Hadoop deployment. The current HDFS version is 0.20.205.

MapReduce

MapReduce is a Hadoop framework for easily writing applications that process large amounts of unstructured and structured data in parallel, in a reliable and fault-tolerant manner. The framework is resilient to hardware failures, handling them transparently from user applications. The current MapReduce version is 0.20.205.


Figure: HDFS and MapReduce in Greenplum HD

Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. This SQL-like interface gives users a row-based storage capability, which, along with compression, results in an improved compression ratio for storing data. The current Hive version is 0.7.1.

Pig

Pig is the procedural language for processing large, semi-structured data sets using the Hadoop MapReduce platform. It enables developers to more easily write MapReduce jobs by providing an alternative programming language to Java. The current Pig version is 0.9.1.

HBase

HBase is a distributed, versioned, column-oriented storage platform that provides random real-time read/write access to Big Data for user applications. The current HBase version is 0.90.4.

Zookeeper

Zookeeper is a highly available system for coordinating distributed processes. Distributed applications use Zookeeper to store and mediate updates to key configuration information. The current Zookeeper version is 3.3.3.

High-Performance gNet for Hadoop

Greenplum Database enables high performance parallel import and export of compressed and uncompressed data from Hadoop clusters using gNet for Hadoop, a parallel communications transport with the industry's first direct query interoperability between Greenplum Database nodes and corresponding Hadoop nodes. To further streamline resource consumption during load times, custom-format data (binary, Pig, Hive, etc.) in Hadoop can be converted to GPDB Format via MapReduce, and then imported into Greenplum Database. This is a high-speed direct integration option that provides an efficient and flexible data exchange between Greenplum Database and Hadoop. gNet for Hadoop is available for both Greenplum HD and Greenplum MR.

Plug-able Storage

Greenplum HD is a 100% open source certified version of Apache Hadoop based off the latest stable release out of the Apache Software Foundation. Greenplum HD brings a pluggable storage layer to Hadoop allowing enterprises to leverage the best of breed storage options with no changes to existing Hadoop applications. In version 1.1, Greenplum HD adds support for Isilon OneFS storage in addition to the Hadoop Distributed File System (HDFS). Backed by the world's largest Hadoop support organization and tested at scale in Greenplum's 1,000 node Analytics Workbench, Greenplum HD provides a complete platform, including installation, training, global support, and value-add beyond simple packaging of the Apache Hadoop distribution. The platform is designed to allow enterprises to focus more on analyzing their business rather than struggling with the technical complexities of configuring and managing a Hadoop cluster.


Figure: Greenplum HD DCA Software Architecture

Greenplum Chorus

The Analytics Cycle

The effectiveness of Big Data Analytics lies not only in the speeds and feeds of the data warehouse or the analytical tools, but also in the agility of the corporate business process, the ability to iterate the analysis quickly on a full set of data, and the ability to collaborate with the entire data science team.

Legacy analytics process requires the analyst to go through the steps of weeding through the data warehouse to find the data, obtain access to the data from all different data owners, scheduling time with data owners to learn about the data structure, request sandbox access from corporate IT, move data to sandbox, analyze the data, and finally operationalize the model. The process is extremely time-intensive and often yields stale and incomplete data set, which in turn jeopardize the accuracy of the model. Additionally, the legacy process is also very siloed to the analyst’s environment, with little room for collaborative efforts.

Figure: Legacy analytics process

Greenplum Chorus enables Big Data agility for your data science team. The first solution of its kind, Greenplum Chorus provides an analytic productivity platform that enables the team to search, explore, visualize, and import data from both within the organization and external sources. It provides rich social network features that revolve around datasets, insights, methods, and workflows, allowing data analysts, data scientists, IT staff, DBAs, executives, and other stakeholders to participate and collaborate on Big Data.

Figure: Analytics cycle with Greenplum Chorus

Using this interface, data scientists can save time and increase their productivity as they navigate the processes involved in big data analytics.

  1. Finding data. Because Greenplum Chorus automatically indexes data, data scientists and analysts can now more easily explore data stores across the enterprise. They can quickly identify datasets regardless of location, browse schemas and Hadoop files, and preview the data for instant understanding.

    Greenplum Chorus’ comprehensive search function allows analysts to search all data within Chorus—including comments, SQL, Hadoop, and Greenplum Database. They can probe the data warehouse and come up with all of the insight that other analysts have generated off of it, producing a list of imports and datasets complete with the comments and questions associated with each.

  2. Getting access to data. In the past, data scientists who wanted access to data needed to deploy and operate a new Massively Parallel Processing (MPP) data warehouse—a time-consuming process. But agile big data practices need to move at a much faster pace.

    Using Greenplum Chorus, IT staff and database administrators can establish pools of commodity servers and storage ahead of demand, allowing them to create new database instances and sandboxes in minutes with just a few clicks. There is no longer a need for analysts and data scientists to file a ticket with IT and wait days or even weeks for a resolution.

  3. Learning about data. As they work to understand datasets, most data scientists must export data to a local desktop, import it into R or a similar tool, and then create a visualization. Greenplum Chorus accelerates insight and understanding by providing data science teams with a rapid visual representation of information.

    Greenplum Chorus supports histograms, frequency, heat-map, time series, and box plot charts for on-demand visualizations. As a precursor to more sophisticated tools, this functionality enables the simple, ad hoc reporting that often serves as a starting point for innovation.

  4. Moving data to a sandbox. Providing data analysts with their own databases and sandboxes is important. But it’s essential that analysts be able to move their data fluidly into a place where they can work on it and combine it with other data of interest.

    With Greenplum Chorus’ self-service provisioning, users can provision new sandboxes as a standalone Greenplum Database on virtual infrastructure provided by VMware’s vFabric infrastructure. Alternatively, they can instantiate new sandbox schemas on the fly within existing Greenplum Database instances.

    Once a sandbox goes live, Greenplum Chorus lets analysts browse for interesting datasets among all those that they have permission to access. Regardless of where the data “lives,” they can request that all or a slice of it be populated into their Chorus sandbox. This applies not only to data in Greenplum Database instances within Chorus, but also sources such as Oracle, Microsoft SQL Server, and Teradata, as well as file sources and web services.

    Throughout these processes, Greenplum Chorus will schedule and manage the flow of data, track dependencies, and update each data scientist’s copy whenever the source database gets updated with new data. At the conclusion of a project, IT can return computing resources to the pool for use in future projects.

  5. Analyzing data. Once data scientists have imported data into their Greenplum Chorus workspaces, they can analyze it and share the results as a new dataset that other analysts can discover and access. They can also invite colleagues to the project and solicit their input.

    The data science teams of today increasingly include business users as well as data scientists. Greenplum Chorus provides a data wizard that lets business users analyze datasets without the need to remember complex syntax and write SQL queries.

    As the data science team works, anyone associated with the workspace—including executives and managers—can monitor progress, rather than simply waiting for results at the end of the project. Meanwhile, database administrators will be able to see a list of sandboxes that are using any instance of data, giving them important context as to the business value of any dataset.

  6. Publishing results. After hours of hard work, too many analysts publish their valuable insights only to a limited audience via email. These insights will remain invisible to searches, and will be difficult to leverage in future analysis projects.

    Greenplum Chorus provides rich social network features that facilitate participation and collaboration among data analysts, IT, DBAs, executives, and other stakeholders. As a result, the entire data science team can bring their datasets to the table and collaboratively discover, share, and discuss insights that have a meaningful impact on the business. For example, the Create Insight feature makes it easy to publish an insight across Chorus. An analyst can share any insight directly with team members, who can then make comments, brainstorm about overlooked possibilities—and generate new questions of their own.

Insights Sharing

Greenplum Chorus breaks down the silos across the enterprise by replacing the backlog of email with a single interface for all your organization’s data together with virtual databases for exploration and innovation, and social collaboration for insight and analysis. Greenplum Chorus provides rich social network features that revolve around datasets, insights and other key Chorus components, allowing data analysts, IT, DBAs, executives and other stakeholders to all participate and collaborate in the same environment. The result is that the data science team can bring ideas to the table and collaboratively discover, share and discuss insights that have a meaningful impact to the business.

User/Access Management and Data Instance

Your Data Science group should not have to worry about where data is stored. Regardless whether data is structured or unstructured, it should be easy to preview, move, and analyze.

In Chorus, a data instance is either a data source or a data destination. Data from Greenplum Database or Greenplum HD’s HDFS (Hadoop Distributed File System) are connected natively to Chorus through the instance setup. Chorus can also access data in existing non-Greenplum systems through the use of Greenplum Database external tables.

User and Access Managmement

Chorus supports direct integration to corporate LDAP or Microsoft Active Directory for user and password management. This eliminates the need for user and password management.

When it comes to data access control, Chorus automatically takes into account database permissions. Users only see the data that they have access to and only perform the functions that administrators have enabled.

Data Instances

Registering Greenplum Database or Hadoop instances in Chorus is straightforward. User can work with data easily whether it is stored in the database or Hadoop. This means that Chorus provides a collaborative platform for the entire analytics workflow. For example, using Chorus, data engineers and analysts can share the code for MapReduce jobs and view the results of those jobs as structured files within Hadoop, or imported the result into the database as structured data for further analysis.

Figure: Instance details and usage

Chorus is deeply integrated with the Greenplum Database. It understands the structure of tables, columns, and other database objects. It exposes useful information to analysts who need to understand the nature of the data as they explore the data cloud.

The Greenplum Database instance registration is done using:

  • Shared-account – Multiple Chorus users can share one account to gain connectivity and data access. Manual role-based access control can also be configured through multiple shared-account instance setups, with appropriate data grants done in the data source. Example: The Sales-Ops Database Instance can be registered with a database account with access to only sales-ops information. At the same time, the Marketing Database Instance can be registered with an account with access to only marketing information, while both instances point to the same database.
  • Non-shared account – If an instance is connected to the data source through a non-shared account, the Chorus user will be prompted for data source account/password upon connection.

Additionally, Chorus can provision and create a new stand alone Greenplum Database on VMWare virtualized hardware cloud, by allowing the Chorus Administrator to specify the parameters of hardware (CPU, memory, etc.)

Figure: Registration or creation of data instance

Data Exploration, Search, and Data Dictionary

Chorus provides an easy way to navigate and search all data in your environment (including the Greenplum Databases, Greenplum HD, indexes, codes, comments, users, etc.) so that data analysts have access to data in a collaborative environment. Chorus encourages users to explore and visualize the data, to create new datasets, and to add documents and comments. This ensures that the sources of data are enriched as part of the everyday work of your data science teams. Chorus then automatically indexes all of this enriched metadata so that users can easily search for anything within your Enterprise Data Cloud.

Data Exploration

With Chorus, analysts can browse and explore datasets across the enterprise. Once imported into the workspace, Chorus will manage the flow of data, and will track the dependency and update their copy when the source is updated with new data. Analysts can operate on the data, and can share the results of their analysis as a new dataset that can be discovered and accessed by other analysts.

Data Search

Greenplum Chorus’ global search function allows analysts to search all data within Chorus—including comments, SQL, Hadoop, and Greenplum Database. They can probe the data warehouse and come up with all of the insight that other analysts have generated off of it, producing a list of imports and datasets complete with the comments and questions associated with each.

Figure: Results form a global search

Data Dictionary

Gone are the days of hunting through email or shared drives for that comment or piece of code. Instead, everything is tracked and available through the Greenplum Chorus interface. Chorus delivers federated search across data assets anywhere in the enterprise. Chorus indexes all metadata, comments, SQL and data assets to create a living data dictionary available in the form of a search prompt.

Workspaces

A workspace is the place where Chorus users collaborate on an analytics project. It is request-able to Chorus administrator, and can either be public (all users have read access) or private (only members have access).

Within each workspace, there can have workspace summery tab, sandbox tab, data sources tab, and work files tab.

Summary

This is a brief description of the workspace, including workspace-specific insights. It will also include recent Activity stream pertaining to this workspace.

Sandbox

In each given workspace, there should only be one sandbox. This is where Chorus users will import or work with the data. For more details, refer to Sandbox section.

Data Sources

This is where Chorus users can search, browse, visualize, and create data sets, which is then imported and used in the sandbox.

Figure: Sample data from the data source

Work Files

All content used in the workspace can be shared and version controlled through the Work Files tab. This includes documents, presentations, or special file types (ex: SQL). SQL work file allows user to edit SQL against the data in the sandbox. This allows users to do analytics directly in DB or use R in DB. Additionally, this allows user to know what the data set looks like and create a data set using the same SQL.

Sandbox

Each workspace contains one sandbox, which is equivalent of a schema in Greenplum Database. A sandbox can contain data from more than one source (Greenplum Database, HDFS, and external sources) by importing all data and perform joints to provide structure to the data.

Self-Service Provisioning of Sandbox

With Chorus, the data science team can create new sandboxes with a few simple clicks. There is no longer a need to file a ticket with IT and wait hours or weeks for the solution. With self-service provisioning, users can provision new sandboxes as a stand alone Greenplum Database on virtual infrastructure provided by VMware’s vFabric. Alternatively, the users can instantiate new sandbox schemas on the fly within existing Greenplum Database instances.

Data Import from External Data Sources

Chorus dramatically simplifies the process of importing data into the analyst sandbox. Users can upload a delimited file (ex: CSV, tab separated, etc.) directly from their desktops, and Chorus will automatically suggest the correct table structure for the data in the file. Users can then further refine the structure and then share the resulting dataset with other Chorus users. Additionally, Chorus also integrates to existing system through a REST API. The data import can be scheduled at an interval for automatic import or manually refreshed.

On-Demand Visualization

Chorus provides the visualization options for data preview to speed insight and understanding of data. Chorus supports histograms, frequency, heat-map, time series and box plot charts for on-demand visualizations. By providing the data science team a rapid visual representation of information, there is no longer a need to export the data to a local desktop, import into R or another analytics tool, and then create a visualization.

Greenplum DCA


Greenplum Data Computing Appliance

The EMC® Greenplum® Data Computing Appliance (DCA) offers the power of a massively parallel processing (MPP) architecture, while delivering the fastest data-loading rate and the best price/performance ratio in the industry—without the complexity and constraints of proprietary hardware. It is a unified Big Data analytics appliance — a modular solution for structured data, unstructured data, and Greenplum partner applications such as business intelligence (BI), and extract, transform and load (ETL).

Enterprises can grow their DCAs as their demand for processing capacity grows or as their analytics requirement evolves. It is easy to start with a single, primary rack, which includes a Greenplum Database Module (Standard or High-Capacity), and expand the appliance in quarter-rack increments using the Greenplum Database Standard Module, Greenplum Database High Capacity Module, Greenplum HD (Hadoop) Module, or the Greenplum Data Integration Accelerator Module in any order and amount, up to 12 racks total. All modules are linked via a high-speed, high-performance, low-latency interconnect.


Greenplum DCA Modules

Greenplum Database Standard Module

The Greenplum Database Standard Module is a purpose-built, highly scalable data-analytics appliance module that architecturally integrates database, computing, storage, and network into an enterprise-class, easy-to-implement system. This module is the industry leader in price and performance.

Greenplum Database High Capacity Module

The Greenplum Database High Capacity Module is a module designed to host multiple petabytes of data without taking up additional space, surging power consumption, or increasing costs. For businesses that require detailed analysis of extremely large amounts of data or those looking for a longer-term archive, this model offers the lowest cost-per-unit data warehouse.

Greenplum Database Modules Hardware Configurations
Module Type Greenplum Database Standard Module Greenplum Database High Capacity Module
Number of Servers 4
Total Number of Cores 48 Cores
Total Memory 192 GB
Storage Type 600 GB 2 TB
Total Number of Storage Drives 48
Useable Capacity (uncompressed) 9 TB 31 TB
Useable Capacity (compressed) 36 TB 124 TB
Scan Rate 24GB/Sec 14GB/Sec
Data Load Rate 10TB/Hour 10TB/Hour

Table: Hardware configurations of the Greenplum Database Modules

Greenplum HD Module

The Greenplum HD Module is the world’s first high-performance data co-processing Hadoop appliance module. The DCA fuses Hadoop with the Greenplum Database, allowing the co-processing of both structured and unstructured data within a single, seamless solution.

Greenplum HD Modules Hardware Configurations
Greenplum HD Module Greenplum HD DCA 1st Rack Greenplum HD DCA 6 Racks Greenplum HD DCA 12 Racks
Hadoop Data Nodes 4 8 88 184
Total CPU core 48 96 1,056 2,208
Total Memory 192 GB 384 GB 4,224 GB 8,832 GB
Total # of Storage Drives (2 TB SATA) 48 96 1,056 2,208
Usable Uncompressed
(3 copies of data)
28 TB 56 TB 616 TB 1,288 TB
Usable Compressed Capacity 4:1 112 TB 224 TB 2,464 TB 5,152 TB

Table: Hardware and sample configuration of the Greenplum HD Module

Greenplum Data Integration Accelerator (DIA) Module

The Greenplum Data Integration Accelerator (DIA) Module is a module designed to host and to provide fast integration for partner analytics applications to Greenplum Data Computing Appliance. For example, it is used to solve the challenges of data loading in a parallel and scalable model, to shorten batch loads or to implement micro-batch loading.

Greenplum DIA Module
Number of Servers 4
Total CPU Cores 48
Total Memory 192 GB
Storage Type 2 TB SATA
Total Number of Storage Drives 48
Number of Storage Drives 48
Usable Capacity (uncompressed) 70 TB

Table: Hardware configuration of the Greenplum DIA Module

White Paper: EMC Greenplum Data Integration Accelerator


Physical Architecture and Configuration


Master Servers

There are two Master Servers, one primary and one standby. The master database runs on the primary Master Server (replicated to the standby Master Server) and is the entry point into the Greenplum Database for end-users and applications. Through the master database, clients connect and submit SQL, MapReduce, and other query statements.

The primary Master Server is also where the global system catalog resides, that is, the set of system tables that contain metadata about the entire Greenplum Database itself. The primary Master Server does not contain any user data, data resides only on the Segment Servers, and also the primary Master Server is not involved in any query execution. The primary Master Server performs the following operations:

  • Authenticates client connections
  • Processes the incoming SQL, MapReduce, and other query commands
  • Distributes the work load between the Segment Instances
  • Presents aggregated final results to the client program

The primary Master Server mirrors logs to the standby so that it is available to take over in the event of a failure. The standby Master Server is kept up to date by a process that synchronizes the write-ahead-log (WAL) from the primary to the standby.

Segment Servers

Segment Servers run segment database instances (Segment Instances). The majority of query processing occurs either within or between Segment Instances.

Each Segment Server runs several individual Segment Instances. Every Segment Instance has a segment of data from each user-defined table and index. In this way, queries are serviced in parallel by every Segment Instance on every Segment Server.

Users do not interact directly with the Segment Servers in a DCA. When a user connects to the database and issues a query, it is to the primary Master Server. Subsequently, the primary Master issues distributed query tasks and then processes are created on each of the Segment Instances to handle the work of that query.

There are two types of Segment Instances: primary segment instance (primary segments) and mirror segment instances (mirror segments). Each primary segment instance has a corresponding mirror segment instance. A mirror segment always resides on a different host and subnet to its corresponding primary segment. This is to ensure that in case of a failover scenario, where a Segment Server is unreachable or down, the mirror counterpart of the primary instance is still available on another Segment Server.

Interconnect Bus

The Interconnect Bus provides a high-speed network that is used to enable high-speed communication between the Master Servers and the Segment Servers, and between Segment Servers themselves. The “interconnect” refers to the inter-process communication between the segments, as well as the network infrastructure on which this communication relies.

The Interconnect Bus represents a private network and must not be connected to public or customer networks. The Interconnect Bus accommodates high-speed connectivity to ETL servers or environments and direct access to a Data Domain storage unit for efficient data protection.

The Interconnect Bus consists of two 32-port Converged Enhanced Ethernet (CEE), Fibre Channel over Ethernet (FCoE) switches. Each switch includes:

  • 24 x 10 GbE ports
  • 8 x Fibre Channel (FC) ports

To maximize throughput, interconnect activity is load-balanced over two interconnect networks. To ensure redundancy, a primary segment and its corresponding mirror segment use different interconnect networks. With this configuration, the DCA can continue operations in the event of a single Interconnect Bus failure.

General Configurations

High Availability, Backup/Restore, and Disaster Recovery



High Availability

The DCA includes features such as hardware and data redundancy, segment mirroring, and resource failover.

Master Server Redundancy

The DCA includes a standby Master Server to serve as a backup in case the primary Master Server becomes inoperative. The standby Master Server is a warm standby. If the primary Master Server fails, the standby is available to take over as the primary. The standby Master Server is kept up to date by a transaction log replication process, which runs on the standby and keeps the data between the primary and standby masters synchronized. If the primary Master Server fails, the log replication process is shut down, and the standby can be activated in its place. Upon activation of the standby, the replicated logs are used to reconstruct the state of the primary Master Server at the time of the last successfully committed transaction.

The activated standby Master Server effectively becomes the Greenplum Database primary Master Server, accepting client connections on the master port.

The primary Master Server does not contain any user data; it contains only the system catalog tables that need to be synchronized between the primary and standby copies. These tables are not updated frequently, but when they are, changes are automatically copied over to the standby Master Server so that it is always kept current with the primary.

Segment Mirroring

Each primary segment instance in the DCA has a corresponding mirror segment instance. If a primary segment fails, its corresponding mirror segment will take over its role; the database stays online throughout. The mirror segment always resides on a different host and subnet than its corresponding primary segment.

During database operations, only the primary segment is active in processing requests. Changes to a primary segment are copied over to its mirror segment on another Segment Server using a file block replication process. Until a failure occurs on the primary segment, there is no live segment instance running on the mirror host for that particular segment of the database—only the replication process.

The system automatically fails over to the mirror segment whenever a primary segment becomes unavailable. A DCA can remain operational if a Segment Instance or Segment Server goes down, as long as all portions of data are available on the remaining active segments. Whenever the master cannot connect to a primary segment, it marks that primary segment as down in the Greenplum Database system catalog and brings up the corresponding mirror segment in its place. A failed segment can be recovered while the system is up and running. The recovery process only copies over the changes that were missed while the segment was out of operation.

In the event of a segment failure, the file replication process is stopped and the mirror segment is automatically brought up as the active segment instance. All database operations then continue using the mirror. While the mirror is active, it is also logging all transactional changes made to the database. When the failed segment is ready to be brought back online, administrators initiate a recovery process to bring it back into operation.

Backup/Restore

Figure: Disaster Recovery using Data Domain

Greenplum DCA backs up all segment servers in parallel directly to Data Domain, with the use of Data Domain Boost. With this backup and restore method, it drastic reduce backup storage and backup network requirement, achieve effective backup rates up to 14TB/hour, and support both full and partial replication, with 1-to-1, 1-to-many, and cascading topologies. For more information on backup and restore with Data Domain, please refer to: .


Disaster Recovery
Figure: SAN Mirror – Greenplum DCA to Symmetrix VMAX

Symmetrix VMAX provides SAN mirroring disaster recovery for a Greenplum DCA environment by leveraging separately licensed TimeFinder/SnapShots and SRDF/Synchronous Replication for DR.

This method doubles available primary storage, split storage between local and SAN instance, with Primary Segment Storage on local disk, Mirror Segment Storage and Standby Master Storage on Symmetrix VMAX. VMAX is not used for reads/queries unless there is a segment failure. For more information on disaster recovery using EMC Symmetrix VMAX, please refer to: White Paper: EMC Greenplum Data Computing Appliance Using SAN Mirror and EMC Symmetrix VMAX for Disaster Recovery.


Greenplum DCA Software

Greenplum DCA Software includes the remote management feature with hardware dial-home capabilities as well as Greenplum Database software dial-home events.

The DCA dials home for the following hardware-related events: single-disk failure, virtual disk or RAID degradation, issues with the RAID controller battery, PSU, power usage, memory, NIC, CPU, fan, 10 GB switch (fan and PSU), or SNMP daemon not working.

The following software issues trigger dial-home events: Greenplum Database unavailable, segment database down, master failover notification, core file dump detected, kernel crash, or disk space over limit. The DCA also dials home weekly to test its connection with EMC.

The DCA Software also includes these weekly dial-home reports: Disk usage, license report, table size/index/overall capacity, number of functional modules and types, health state of the database, percent of hard drive full.

Command Center

Command Center

Greenplum Command Center is a single console with a set of interactive dashboards, enabling administrators to collect performance metrics and manage system health for Greenplum products. Monitored data is also stored for historical reporting.

Greenplum Command Center has data collection agents that run on the Master Server and each Segment Server. The agents collect performance data on query execution and system utilization and send it to the Greenplum Command Center at regular intervals.

Greenplum Command Center Console provides a graphical web-based user-interface application where administrators can:

  • Perform query monitoring and optimization
  • Perform workload management
  • View historical query and system metrics
  • View alerts and Dial-home information
  • View performance of the Greenplum DCA

Back to Top