Greenplum Technologies

Technology Overview

Greenplum Unified Analytics Platform (UAP)

Greenplum Unified Analytics Platform (UAP) is a unified platform enabling agile Big Data Analytics by empowering data science teams to analyze structured and unstructured data in a unified platform. Greenplum UAP delivers a simple entry point for organizations seeking business agility through high-performance data analytics, enabling them to become data savvy, to develop a data-centric culture and to evolve into data-driven companies.

The Greenplum UAP solution integrates Greenplum Database, an analytics optimized relational database, Pivotal HD, the world's most powerful Hadoop distribution, Greenplum Chorus, an analytics collaboration platform, Greenplum DCA, a flexible appliance for hosting the Greenplum UAP and Administration Tools for managing all components. The UAP solution is augmented by the talent and services of the Greenplum Data Scientist team. The diagram below highlights the Greenplum products in UAP and explains how all parts work together to enable agile Big Data Analytics.

Hover over the UAP diagram with your mouse to learn about the technologies behind Greenplum products.

Greenplum Database

Greenplum Database™

Greenplum Database (GP-DB) is an extensible relational database platform that uses a shared-nothing, massively parallel processing (MPP) architecture built atop commodity hardware to vastly accelerate analytical processing.

Most of today’s general-purpose relational databases (e.g. Oracle, Microsoft SQL Server) originated as Online Transaction Processing (OLTP) systems. Their 'shared-disk' or 'shared-everything' architectures are optimized for high transaction rates at the expense of analytical query performance and concurrency.

Greenplum Database’s shared-nothing MPP architecture provides every segment with an independent high-bandwidth connection to dedicated storage. The segment servers are able to process every query in a fully parallel manner, use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates. The degree of parallelism and overall scalability that this allows far exceeds general-purpose database systems.

By transparently distributing data and work across multiple 'segment' servers, GP-DB executes mathematically intensive analytical queries “close to the data” with performance that scales linearly with the number of segment servers.

Analytical capabilities are extensible, through MADlib, an open-source library of statistical and mathematical algorithms for statistical functions correlation, segmentation and machine learning, or through custom code extensions provided by users. Workload management facilities balance concurrent queries, while automatic failover provides high availability by taking advantage redundant hardware, providing an enterprise-capable analytics platform.

As data sets increasingly of include unstructured data – text, logs, images or sound, GP-DB provides fast, parallel integration with Hadoop to enable “co-processing”– analysis of both structured and unstructured data within a unified analytics platform or UAP.

Greenplum Database is available in software or pre-installed on a Greenplum DCA.

gNet™ Software Interconnect

In shared-nothing MPP database systems, data often needs to be moved whenever there is a join or an aggregation process for which the data requires repartitioning across the segments. As a result, the interconnect serves as one of the most critical components within Greenplum Database. Greenplum’s gNet interconnect optimizes the flow of data to allow continuous pipelining of processing without blocking on all nodes of the system. The gNet interconnect is tuned and optimized to scale to 10,000s of processors and leverages commodity Gigabit Ethernet and 10GigE switch technology.

At its core, the gNet software interconnect is a supercomputing-based ‘soft-switch’ that is responsible for efficiently pumping streams of data between motion nodes during query-plan execution. It delivers messages, moves data, collects results, and coordinates work among the segments in the system. It is infrastructure underpinning the execution of motion nodes that occur within parallel query plans on the Greenplum system.

Within the execution of each node in the query plan, multiple relational operations are processed by pipelining. For example, while a table scan is taking place, rows selected can be pipelined into a join process. Pipelining is the ability to begin a task before its predecessor task has completed, and this ability is key to increasing basic query parallelism. Greenplum Database utilizes pipelining whenever possible to ensure the highest-possible performance.

Greenplum MapReduce

The Power of Parallel Computing for Large-Scale Data Warehousing and Analytics

MapReduce has been proven as a technique for high-scale data analysis by Internet leaders such as Google and Yahoo. Greenplum gives enterprises the best of both worlds – MapReduce for programmers and SQL for DBAs – and will execute both MapReduce and SQL directly within Greenplum’s parallel dataflow engine, which is at the heart of the Greenplum Database.

Greenplum MapReduce enables programmers to run analytics against petabyte-scale datasets stored in and outside of the Greenplum Database. Greenplum MapReduce brings the benefits of a growing standard programming model to the reliability and familiarity of the relational database. The new capability expands the Greenplum Database to support MapReduce programs.

“Greenplum has been mastering the use of a parallel dataflow engine as the heart of our core product, Greenplum Database. With several years of experience developing and deploying Greenplum Database at companies large and small, we know how to build highly scalable data solutions. Adding MapReduce means our customers will be able to use a leading-edge new technology on a stable, reliable foundation.”
Luke Lonergan, CTO, Greenplum

Key Driver: Need For Petabyte-Scale Data Analytics

Greenplum customers have been involved in an early-access program utilizing Greenplum MapReduce for advanced analytics. For example, LinkedIn is using Greenplum Database for new innovative social networking features such as “People You May Know” and is evaluating Greenplum MapReduce as a way to develop compelling analytics products faster. A primary benefit of the new capability is that customers can combine SQL queries and MapReduce programs into unified tasks that are executed in parallel across hundreds or thousands of cores.

“The integration of MapReduce into Greenplum Database creates new ways to manage our text analysis efforts. What previously would require us to take data out of the database or write complex SQL queries can now be simplified into a few lines of code.”
Roger Magoulas, Research Director, O’Reilly Media

"The most exciting aspect of MapReduce is the excitement it is generating. It's attracting talented programmers -- many of whom don't want to buy or use SQL databases -- and enabling them to wrangle enormous data sets without leaving their familiar programming paradigms. Any movement that brings that much compute power to a larger talent base has the potential to produce game-changing results."
Joe Hellerstein, Professor, UC Berkeley

Key Features of Greenplum MapReduce
  • Combine SQL & MapReduce
  • Process any type of data wherever it lives
  • Enterprise-level integration and support

Parallel Query Optimizer

Greenplum Database's parallel query optimizer is responsible for converting SQL or MapReduce into a physical execution plan. It does this using a cost-based optimization algorithm in which it evaluates a vast number of potential plans and selects the one that it believes will lead to the most efficient query execution.

Unlike a traditional query optimizer, Greenplum's optimizer takes a global view of execution across the cluster, and factors in the cost of moving data between nodes in any candidate plan. The benefit of this 'global' query planning approach is that it can use global knowledge and statistical estimates to build an optimal plan once and ensure all nodes execute it in a fully coordinated fashion. This leads to far more predictable results than the alternative approach of 'SQL pushing' snippets that must be re-planned at each node.

The resulting query plans contain traditional physical operations - scans, joins, sorts, aggregations, etc. - as well as parallel 'motion' operations that describe when and how data should be transferred between nodes during query execution.

Greenplum Database has three kinds of 'motion' operations that may be found in a query plan:

  • Broadcast Motion (N:N) - Every segment sends the target data to all other segments
  • Redistribute Motion (N:N) - Every segment rehashes the target data (by join column) and redistributes each row to the appropriate segment
  • Gather Motion (N:1) - Every segment sends the target data to a single node (usually the master)

Here is an example SQL statement, and the resulting physical execution plan containing 'motion' operations:

Multi-Level Fault Tolerance

Greenplum Database is architected to have no single point of failure. Internally the system utilizes log shipping and segment-level replication to achieve redundancy, and provides automated failover.

The system provides multiple levels of redundancy and integrity checking. At the lowest level, Greenplum Database utilizes RAID-0+1 or RAID-5 storage to detect and mask disk failures. At the system level, Greenplum continuously replicates all segment and master data to other nodes within the system to ensure that the loss of a machine will not impact the overall database availability. Greenplum also utilizes redundant network interfaces on all systems, and specifies redundant switches in all reference configurations.

Figure: Segment Configurations – 4 Servers, 3 Active Segments per Server

Figure: 1 Server Failure – 3 Servers, 4 Active Segments per Server

The result is a system that reliability requirements of some of the most mission critical operations in the world.

Polymorphic Data Storage™

Traditionally relational data has been stored in rows - i.e. as a sequence of tuples in which all the columns of each tuple are stored together on disk. This has a long heritage back to early OLTP systems that introduced the 'slotted page' layout that is still in common use today. However analytical databases tend to have different access patterns than OLTP systems. Instead of seeing many single-row reads and writes, analytical databases must process larger more complex queries that touch much larger volumes of data - i.e. read-mostly with big scanning reads and infrequent batch appends of data.

Vendors have taken a number of different approaches to meeting these needs. Some have optimized their disk layouts to eliminate the OLTP fat and do smarter disk scans. Others have turned their storage sideways (literally) with column-stores - i.e. the decades old idea of 'vertical decomposition' demonstrated to good success by Sybase IQ, and now reimagined by a raft of newer vendors. Each of these approaches has proven to have sweet-spots where they shine, and others where they do a less admirable job.

Rather than advocating for one approach or the other, we've built in the flexibility so that customers can choose the right strategy for the job at hand. We call this Polymorphic Data Storage™. For each table (or partition of a table), the DBA can select the storage, execution and compression settings that suit the way that table will be accessed. With Polymorphic Data Storage™, the database transparently abstracts the details of any table or partition, allowing a wide variety of underlying models:

  • Read/Write Optimized — Traditional 'slotted page' row-oriented table (based on PostgreSQL's native table type), optimized for fine-grained CRUD operations.
  • Row-Oriented / Read-Mostly Optimized -- Optimized for read-mostly scans and bulk append loads. DDL allows optional compression ranging from fast/light to deep/archival.
  • Column-Oriented / Read-Mostly Optimized — Added as a feature in Greenplum's latest 3.3.4 release, providing a true column-store just by specifying 'WITH (orientation=column)' on a table. Data is vertically partitioned, and each column is stored in a series of large densely-packed blocks that can be efficiently compressed from fast/light to deep/archival (and tend to see notably higher compression ratios than row-oriented tables). Performance is excellent for those workloads suited to column-store — Greenplum's implementation only scans those columns required by the query, doesn't have the overhead of per-tuple IDs, and does efficient early materialization using an optimized 'columnar append' operator.

Greenplum's Polymorphic Data Storage™ really shines when combined with Greenplum's multi-level table partitioning. With Polymorphic Data Storage™, customers can tune the storage types and compression settings of different partitions within the same table. i.e. a single partitioned table could (for example) have older data stored as 'column-oriented with deep/archival compression', more recent data as 'column-oriented with fast/light compression', and the most recent data as 'read/write optimized' to support fast updates and deletes.

Parallel Dataflow Engine

At the heart of the Greenplum Database is the Parallel Dataflow Engine. This is where the real work of processing and analyzing data is done. The Parallel Dataflow Engine is an optimized parallel processing infrastructure that is designed to process data as it flows from disk, from external files or applications, or from other segments over the gNet interconnect. The engine is inherently parallel – its spans all segments of a Greenplum Database cluster and can scale effectively to 1000s of commodity processing cores.

The engine was designed based on supercomputing principles, with the idea that large volumes of data have ‘weight’ (i.e. aren’t easily moved around) and so processing should be pushed as close as possible to the data. In the Greenplum architecture this coupling is extremely efficient, with massive I/O bandwidth directly to and from the engine on each segment. The result is that a wide variety of complex processing can be pushed down as close as possible to the data for maximum processing efficiency and incredible expressiveness.

Greenplum’s Parallel Dataflow Engine is highly optimized at executing both SQL and MapReduce, and does so in a massively parallel manner. It has the ability to directly execute all necessary SQL building blocks, including performance-critical operations such as hash-join, multi-stage hash-aggregation, SQL 2003 windowing (which is part of the SQL 2003 OLAP extensions that are implemented by Greenplum), and arbitrary MapReduce programs.

MPP Scatter/Gather Streaming™ Technology

Greenplum's new MPP Scatter/Gather Streaming™ (SG Streaming™) technology eliminates the bottlenecks associated with other approaches to data loading, enabling lightning-fast flow of data into the Greenplum Database for large-scale analytics and data warehousing. Greenplum customers are achieving production-loading speeds of over four terabytes per hour with negligible impact on concurrent database operations.

Scatter/Gather Streaming:

  • Manages the flow of data into all nodes of the database
  • Does not require additional software or systems
  • Takes advantage of the same Parallel Dataflow Engine nodes in Greenplum Database

Greenplum utilizes a 'parallel-everywhere' approach to loading in which data flows from one or more source systems to every node of the database without any sequential choke points. This differs from traditional “bulk loading” technologies, used by most mainstream database and MPP appliance vendors that push data from a single source, often over a single or small number of parallel channels, and result in fundamental bottlenecks and ever-increasing load times. Greenplum's approach also avoids the need for a 'loader' tier of servers, as required by some other MPP database vendors that can add significant complexity and cost while effectively bottlenecking the bandwidth and parallelism of communication into the database.

Greenplum’s SG Streaming™ technology ensures parallelism by 'scattering' data from all source systems across 100s or 1000s of parallel streams that simultaneously flow to all nodes of the Greenplum Database. Performance scales with the number of Greenplum Database nodes, and the technology supports both large batch and continuous near-real-time loading patterns with negligible impact on concurrent database operations. Data can be transformed and processed in-flight, utilizing all nodes of the database in parallel, for extremely high-performance ELT (extract-load-transform) and ETLT (extract-transform-load-transform) loading pipelines. Final 'gathering' and storage of data to disk takes place on all nodes simultaneously, with data automatically partitioned across nodes and optionally compressed. This technology is exposed to the DBA via a flexible and programmable "external table" interface and a traditional command-line loading interface.

Pivotal HD

Pivotal HD Enterprise

Pivotal HD Enterprise is a commercially-supported, enterprise-capable distribution of the Apache Hadoop stack. It includes Hadoop Distributed File System (HDFS), MapReduce, Hive, Pig, Hbase, Zookeeper, Yarn and Mahout. Running Pivotal HD Enterprise’s commercial Hadoop distribution on a Greenplum DCA removes the pain associated with building out, debugging, and monitoring Hadoop clusters from scratch, as required by other distributions. Greenplum has also incorporated a pluggable storage layer into Pivotal HD, enabling customers to choose from best-of-breed storage options — including traditional attached storage and high-performance scale-out NAS products such as EMC Isilon OneFS — without requiring changes to existing applications.

Advanced Database Services


Native SQL for Hadoop

Pivotal Advanced Database Services (ADS) integrates the industry’s first native, mature Massively Parallel Processing (MPP) SQL query processor with Apache Hadoop. ADS leverages your existing SQL-capable BI, ETL and analytics tools, along with your workforce’s SQL skills, simplifying Hadoop-based data analytics development, increasing team productivity, and cutting costs.

Benefits Include

  • Unprecedented Query Processing Performance
  • Delivers 100X improvement in query performance
  • Enables true, interactive, and deep SQL processing
  • Offers powerful analytics
  • Cost-based Query Optimization

    Leveraging ten years of innovation in MPP SQL-based analytics, ADS’s cost-based query optimizer delivers unmatched query optimization that decisively outperforms less mature SQL-based or SQL-like Hadoop alternatives.

    Xtension Framework: Hadoop Data from SQL

    ADS integrates directly with data sets already stored in Hadoop HDFS via MapReduce, Hbase or Hive, enabling direct SQL access to Hadoop data, cutting latency and reducing the need for duplication or movement of data.

    Advanced Analytics Functions:

    ADS includes a library of parallelized analytics algorithms to speed analytics development and execution. With ADS, users can apply the computational power of MPP and Hadoop to run compute-intensive statistical, mathematical, and machine learning calculations on HDFS data. Algorithms are built into the query processor as SQL commands, acting directly on data stored in Hadoop HDFS. Compared to traditional approaches, ADS algorithms can often accelerate analytical computation by orders of magnitude.


    Figure: Pivotal HD Software Architecture

    Hadoop Components


    Hadoop Distributed File System (HDFS)

    HDFS is a Java-based file system that provides scalable and reliable data storage. With industry installation in the thousands of nodes, HDFS has proven to be a solid foundation for any Hadoop deployment. The current HDFS version is 0.20.205.

    MapReduce

    MapReduce is a Hadoop framework for easily writing applications that process large amounts of unstructured and structured data in parallel, in a reliable and fault-tolerant manner. The framework is resilient to hardware failures, handling them transparently from user applications. The current MapReduce version is 0.20.205.


    Figure: HDFS and MapReduce in Greenplum HD

    Hive

    Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. This SQL-like interface gives users a row-based storage capability, which, along with compression, results in an improved compression ratio for storing data. The current Hive version is 0.7.1.

    Pig

    Pig is the procedural language for processing large, semi-structured data sets using the Hadoop MapReduce platform. It enables developers to more easily write MapReduce jobs by providing an alternative programming language to Java. The current Pig version is 0.9.1.

    HBase

    HBase is a distributed, versioned, column-oriented storage platform that provides random real-time read/write access to Big Data for user applications. The current HBase version is 0.90.4.

    Zookeeper

    Zookeeper is a highly available system for coordinating distributed processes. Distributed applications use Zookeeper to store and mediate updates to key configuration information. The current Zookeeper version is 3.3.3.

    High-Performance gNet for Hadoop

    Greenplum Database enables high performance parallel import and export of compressed and uncompressed data from Hadoop clusters using gNet for Hadoop, a parallel communications transport with the industry's first direct query interoperability between Greenplum Database nodes and corresponding Hadoop nodes. To further streamline resource consumption during load times, custom-format data (binary, Pig, Hive, etc.) in Hadoop can be converted to GPDB Format via MapReduce, and then imported into Greenplum Database. This is a high-speed direct integration option that provides an efficient and flexible data exchange between Greenplum Database and Hadoop. gNet for Hadoop is available for both Greenplum HD and Greenplum MR.

    Plug-able Storage

    Greenplum HD is a 100% open source certified version of Apache Hadoop based off the latest stable release out of the Apache Software Foundation. Greenplum HD brings a pluggable storage layer to Hadoop allowing enterprises to leverage the best of breed storage options with no changes to existing Hadoop applications. In version 1.1, Greenplum HD adds support for Isilon OneFS storage in addition to the Hadoop Distributed File System (HDFS). Backed by the world's largest Hadoop support organization and tested at scale in Greenplum's 1,000 node Analytics Workbench, Greenplum HD provides a complete platform, including installation, training, global support, and value-add beyond simple packaging of the Apache Hadoop distribution. The platform is designed to allow enterprises to focus more on analyzing their business rather than struggling with the technical complexities of configuring and managing a Hadoop cluster.


    Figure: Pivotal HD Software Architecture

    Greenplum Chorus

    The Analytics Cycle

    The effectiveness of Big Data Analytics lies not only in the speeds and feeds of the data warehouse or the analytical tools, but also in the agility of the corporate business process, the ability to iterate the analysis quickly on a full set of data, and the ability to collaborate with the entire data science team.

    Legacy analytics process requires the analyst to go through the steps of weeding through the data warehouse to find the data, obtain access to the data from all different data owners, scheduling time with data owners to learn about the data structure, request sandbox access from corporate IT, move data to sandbox, analyze the data, and finally operationalize the model. The process is extremely time-intensive and often yields stale and incomplete data set, which in turn jeopardize the accuracy of the model. Additionally, the legacy process is also very siloed to the analyst’s environment, with little room for collaborative efforts.

    Figure: Legacy analytics process

    Greenplum Chorus enables Big Data agility for your data science team. The first solution of its kind, Greenplum Chorus provides an analytic productivity platform that enables the team to search, explore, visualize, and import data from both within the organization and external sources. It provides rich social network features that revolve around datasets, insights, methods, and workflows, allowing data analysts, data scientists, IT staff, DBAs, executives, and other stakeholders to participate and collaborate on Big Data.

    Figure: Analytics cycle with Greenplum Chorus

    Using this interface, data scientists can save time and increase their productivity as they navigate the processes involved in big data analytics.

    1. Finding data. Because Greenplum Chorus automatically indexes data, data scientists and analysts can now more easily explore data stores across the enterprise. They can quickly identify datasets regardless of location, browse schemas and Hadoop files, and preview the data for instant understanding.

      Greenplum Chorus’ comprehensive search function allows analysts to search all data within Chorus—including comments, SQL, Hadoop, and Greenplum Database. They can probe the data warehouse and come up with all of the insight that other analysts have generated off of it, producing a list of imports and datasets complete with the comments and questions associated with each.

    2. Getting access to data. In the past, data scientists who wanted access to data needed to deploy and operate a new Massively Parallel Processing (MPP) data warehouse—a time-consuming process. But agile big data practices need to move at a much faster pace.

      Using Greenplum Chorus, IT staff and database administrators can establish pools of commodity servers and storage ahead of demand, allowing them to create new database instances and sandboxes in minutes with just a few clicks. There is no longer a need for analysts and data scientists to file a ticket with IT and wait days or even weeks for a resolution.

    3. Learning about data. As they work to understand datasets, most data scientists must export data to a local desktop, import it into R or a similar tool, and then create a visualization. Greenplum Chorus accelerates insight and understanding by providing data science teams with a rapid visual representation of information.

      Greenplum Chorus supports histograms, frequency, heat-map, time series, and box plot charts for on-demand visualizations. As a precursor to more sophisticated tools, this functionality enables the simple, ad hoc reporting that often serves as a starting point for innovation.

    4. Moving data to a sandbox. Providing data analysts with their own databases and sandboxes is important. But it’s essential that analysts be able to move their data fluidly into a place where they can work on it and combine it with other data of interest.

      With Greenplum Chorus’ self-service provisioning, users can provision new sandboxes as a standalone Greenplum Database on virtual infrastructure provided by VMware’s vFabric infrastructure. Alternatively, they can instantiate new sandbox schemas on the fly within existing Greenplum Database instances.

      Once a sandbox goes live, Greenplum Chorus lets analysts browse for interesting datasets among all those that they have permission to access. Regardless of where the data “lives,” they can request that all or a slice of it be populated into their Chorus sandbox. This applies not only to data in Greenplum Database instances within Chorus, but also sources such as Oracle, Microsoft SQL Server, and Teradata, as well as file sources and web services.

      Throughout these processes, Greenplum Chorus will schedule and manage the flow of data, track dependencies, and update each data scientist’s copy whenever the source database gets updated with new data. At the conclusion of a project, IT can return computing resources to the pool for use in future projects.

    5. Analyzing data. Once data scientists have imported data into their Greenplum Chorus workspaces, they can analyze it and share the results as a new dataset that other analysts can discover and access. They can also invite colleagues to the project and solicit their input.

      The data science teams of today increasingly include business users as well as data scientists. Greenplum Chorus provides a data wizard that lets business users analyze datasets without the need to remember complex syntax and write SQL queries.

      As the data science team works, anyone associated with the workspace—including executives and managers—can monitor progress, rather than simply waiting for results at the end of the project. Meanwhile, database administrators will be able to see a list of sandboxes that are using any instance of data, giving them important context as to the business value of any dataset.

    6. Publishing results. After hours of hard work, too many analysts publish their valuable insights only to a limited audience via email. These insights will remain invisible to searches, and will be difficult to leverage in future analysis projects.

      Greenplum Chorus provides rich social network features that facilitate participation and collaboration among data analysts, IT, DBAs, executives, and other stakeholders. As a result, the entire data science team can bring their datasets to the table and collaboratively discover, share, and discuss insights that have a meaningful impact on the business. For example, the Create Insight feature makes it easy to publish an insight across Chorus. An analyst can share any insight directly with team members, who can then make comments, brainstorm about overlooked possibilities—and generate new questions of their own.

    Insights Sharing

    Greenplum Chorus breaks down the silos across the enterprise by replacing the backlog of email with a single interface for all your organization’s data together with virtual databases for exploration and innovation, and social collaboration for insight and analysis. Greenplum Chorus provides rich social network features that revolve around datasets, insights and other key Chorus components, allowing data analysts, IT, DBAs, executives and other stakeholders to all participate and collaborate in the same environment. The result is that the data science team can bring ideas to the table and collaboratively discover, share and discuss insights that have a meaningful impact to the business.

    User/Access Management and Data Instance

    Your Data Science group should not have to worry about where data is stored. Regardless whether data is structured or unstructured, it should be easy to preview, move, and analyze.

    In Chorus, a data instance is either a data source or a data destination. Data from Greenplum Database or Greenplum HD’s HDFS (Hadoop Distributed File System) are connected natively to Chorus through the instance setup. Chorus can also access data in existing non-Greenplum systems through the use of Greenplum Database external tables.

    User and Access Management

    Chorus supports direct integration to corporate LDAP or Microsoft Active Directory for user and password management. This eliminates the need for user and password management.

    When it comes to data access control, Chorus automatically takes into account database permissions. Users only see the data that they have access to and only perform the functions that administrators have enabled.

    Data Instances

    Registering Greenplum Database or Hadoop instances in Chorus is straightforward. User can work with data easily whether it is stored in the database or Hadoop. This means that Chorus provides a collaborative platform for the entire analytics workflow. For example, using Chorus, data engineers and analysts can share the code for MapReduce jobs and view the results of those jobs as structured files within Hadoop, or imported the result into the database as structured data for further analysis.

    Figure: Instance details and usage

    Chorus is deeply integrated with the Greenplum Database. It understands the structure of tables, columns, and other database objects. It exposes useful information to analysts who need to understand the nature of the data as they explore the data cloud.

    The Greenplum Database instance registration is done using:

    • Shared-account – Multiple Chorus users can share one account to gain connectivity and data access. Manual role-based access control can also be configured through multiple shared-account instance setups, with appropriate data grants done in the data source. Example: The Sales-Ops Database Instance can be registered with a database account with access to only sales-ops information. At the same time, the Marketing Database Instance can be registered with an account with access to only marketing information, while both instances point to the same database.
    • Non-shared account – If an instance is connected to the data source through a non-shared account, the Chorus user will be prompted for data source account/password upon connection.

    Additionally, Chorus can provision and create a new stand alone Greenplum Database on VMWare virtualized hardware cloud, by allowing the Chorus Administrator to specify the parameters of hardware (CPU, memory, etc.)

    Figure: Registration or creation of data instance

    Data Exploration, Search, and Data Dictionary

    Chorus provides an easy way to navigate and search all data in your environment (including the Greenplum Databases, Greenplum HD, indexes, codes, comments, users, etc.) so that data analysts have access to data in a collaborative environment. Chorus encourages users to explore and visualize the data, to create new datasets, and to add documents and comments. This ensures that the sources of data are enriched as part of the everyday work of your data science teams. Chorus then automatically indexes all of this enriched metadata so that users can easily search for anything within your Enterprise Data Cloud.

    Data Exploration

    With Chorus, analysts can browse and explore datasets across the enterprise. Once imported into the workspace, Chorus will manage the flow of data, and will track the dependency and update their copy when the source is updated with new data. Analysts can operate on the data, and can share the results of their analysis as a new dataset that can be discovered and accessed by other analysts.

    Data Search

    Greenplum Chorus’ global search function allows analysts to search all data within Chorus—including comments, SQL, Hadoop, and Greenplum Database. They can probe the data warehouse and come up with all of the insight that other analysts have generated off of it, producing a list of imports and datasets complete with the comments and questions associated with each.

    Figure: Results form a global search

    Data Dictionary

    Gone are the days of hunting through email or shared drives for that comment or piece of code. Instead, everything is tracked and available through the Greenplum Chorus interface. Chorus delivers federated search across data assets anywhere in the enterprise. Chorus indexes all metadata, comments, SQL and data assets to create a living data dictionary available in the form of a search prompt.

    Workspaces

    A workspace is the place where Chorus users collaborate on an analytics project. It is request-able to Chorus administrator, and can either be public (all users have read access) or private (only members have access).

    Within each workspace, there can have workspace summery tab, sandbox tab, data sources tab, and work files tab.

    Summary

    This is a brief description of the workspace, including workspace-specific insights. It will also include recent Activity stream pertaining to this workspace.

    Sandbox

    In each given workspace, there should only be one sandbox. This is where Chorus users will import or work with the data. For more details, refer to Sandbox section.

    Data Sources

    This is where Chorus users can search, browse, visualize, and create data sets, which is then imported and used in the sandbox.

    Figure: Sample data from the data source

    Work Files

    All content used in the workspace can be shared and version controlled through the Work Files tab. This includes documents, presentations, or special file types (ex: SQL). SQL work file allows user to edit SQL against the data in the sandbox. This allows users to do analytics directly in DB or use R in DB. Additionally, this allows user to know what the data set looks like and create a data set using the same SQL.

    Sandbox

    Each workspace contains one sandbox, which is equivalent of a schema in Greenplum Database. A sandbox can contain data from more than one source (Greenplum Database, HDFS, and external sources) by importing all data and perform joints to provide structure to the data.

    Self-Service Provisioning of Sandbox

    With Chorus, the data science team can create new sandboxes with a few simple clicks. There is no longer a need to file a ticket with IT and wait hours or weeks for the solution. With self-service provisioning, users can provision new sandboxes as a stand alone Greenplum Database on virtual infrastructure provided by VMware’s vFabric. Alternatively, the users can instantiate new sandbox schemas on the fly within existing Greenplum Database instances.

    Data Import from External Data Sources

    Chorus dramatically simplifies the process of importing data into the analyst sandbox. Users can upload a delimited file (ex: CSV, tab separated, etc.) directly from their desktops, and Chorus will automatically suggest the correct table structure for the data in the file. Users can then further refine the structure and then share the resulting dataset with other Chorus users. Additionally, Chorus also integrates to existing system through a REST API. The data import can be scheduled at an interval for automatic import or manually refreshed.

    On-Demand Visualization

    Chorus provides the visualization options for data preview to speed insight and understanding of data. Chorus supports histograms, frequency, heat-map, time series and box plot charts for on-demand visualizations. By providing the data science team a rapid visual representation of information, there is no longer a need to export the data to a local desktop, import into R or another analytics tool, and then create a visualization.

    Greenplum DCA


    Greenplum Data Computing Appliance

    The EMC® Greenplum® Data Computing Appliance (DCA) is a unified Big Data analytics appliance — a modular and flexible solution for analyzing structured and unstructured data. Their modular design permits addition of Data Integration Accelerator modules to support installation of software applications from Greenplum partner companies that increase capability while cut costs and improving overall performance.

    DCAs offer the power of a massively parallel processing (MPP) architecture without the constraints and complexity of proprietary hardware continuing Greenplum’s history of delivering the fastest data loading rates and the best price/performance ratios in the industry.

    DCAs also offer flexibility and scalability not usually found among analytics appliances. By adding additional modules, enterprises can grow storage or computational capacity as needs grow, without replacing systems. By adding new module types, users can expand database and Hadoop DCAs to support both capabilities, plus additional partner capabilities including analytics accelerators, data visualization tools, data mining tools, BI platforms, reporting tools, ETL tools and data staging capabilities within the DCA.

    All DCA modules run within a common appliance framework, interconnected via a high-speed, low-latency interconnect and managed by a common administration framework.

    Greenplum DCA Modules

    Module Types

    Greenplum DCA modules are available to serve three roles, and each includes servers, memory, storage, management software and platform software.

    Greenplum Database Modules:

    Database modules are optimized for execution analytics and BI on big data using SQL and embedded analytical algorithms. Two configurations are available:

    The Greenplum Database Standard Module

    The Greenplum Database Standard Module is a purpose-built, highly scalable data-analytics appliance module that architecturally integrates database, computing, storage, and network into an enterprise-class, easy-to-implement system. This module is the industry leader in price and performance.

    Greenplum Database Compute Module

    The Greenplum Database Compute Module provides a cost-effective way to support compute-intensive analytics on modest data sets. Where needs include support for computationally-intensive workloads – such as data mining, exploration etc., or high user counts, the high performance and relative modest storage capacity of the Database Compute module may be a good fit.

    Greenplum Database Modules Hardware Configurations

    Module Type

    Database Standard Module

    Database Compute Module

    Size / Rack Configuration

    8U –  4 modules per rack

    Number of Servers

    4 per module

    Total Number of Cores

    64 per module

    Total Memory

    256GB per module

    Number of Storage Drives

    96 per module

    Storage Type

    900 GB SAS 10K

    300 GB SAS 10K

    Usable Capacity - User Data

    110 TB per module

    36 TB per module

    Usable Capacity – Compressed

    27.5 TB per module

    9TB per module

    Scan Rate

    40 GB/sec. per full rack

    Data Load Rate

    16 TB/hr. per full rack

    Both database module types connect to the high-speed interconnect in the DCA as well as Command Center and the administration framework of the DCA.

    Greenplum HD Modules

    The Greenplum HD Modules provide a platform for high-performance data co-processing on structured and unstructured data. Preinstalled with Greenplum HD, commercially supported distribution of Hadoop, HD modules can either be integrated into an all-Hadoop appliance or be integrated with Greenplum Database modules, sharing data with Greenplum Database to enable co-processing of structured and unstructured data within a single, seamless solution.

    HD Modules are available in two forms, one with direct-attach storage (disks in the module) and without native storage for integration with external HDFS storage. This allows Hadoop users choices between best-of-breed storage solutions. As with other DCA modules, HD modules are fully administered and monitored from DCA administration tools.

    Greenplum HD Server Module

    HD Server Modules combine computational resources for MapReduce with SATA-based high-capacity directly-attached storage for Hadoop Distributed File System (HDFS).

    Greenplum HD Compute Module

    HD Compute Modules provide computational resources for MapReduce without direct-attached storage. Leveraging Greenplum HD’s pluggable storage capability, HD Compute Modules are 100% Hadoop compatible, but map HDFS storage to an EMC Isilon OneFS scale-out NAS integrated with the DCA. Offering far greater compression and additional redundancy over Apache HDFS, HD Compute Modules when used with Isilon provide an enterprise-class alternative for users with stringent data protection requirements. HD Compute Modules also enable users to scale compute resources independently from storage resources.

    Greenplum HD Modules Hardware Configurations

    Module Type

    HD Server Module

    HD Compute Module

    Size / Rack Configuration

    8U –  4 Modules /Rack

    2U – Up to 11 Modules/Per Rack 1

    Number of Servers

    4 per module

    2 per module

    Total Number of Cores

    64 per module

    32 per module

    Total Memory

    256GB per module

    128GB per module

    Number of Data Storage Drives

    48 per module

    External 2

    Storage Type

    3TB SATA 7200

    --

    HDFS Capacity - User Data (N=3 redundancy, no compression)

    144 TB/Rack

    --

    1 Rack Configuration Varies by Rack and Other Modules In the Rack
    2 Data Storage Requires Integration with EMC Isilon Scale-out NAS

    Both HD module types connect to the high-speed interconnect in the DCA, leverage the interconnect for parallel integration with database modules, and are manageable thorugh the DCA’s Command Center utilities and the administration infrastructure.

    Greenplum Data Integration Accelerator (DIA) Modules

    The Greenplum Data Integration Accelerator (DIA) Modules provide open platforms in the DCA for integrating partner analytics applications and data management applications into the Greenplum DCA. Consisting of complete Linux-based servers, DIAs reduce the cost for deploying additional tools into the DCA. They can be used to increase performance for large volume or low-latency ETL, file staging for parallel loading or unloading or ETL, or to host compute-intensive data analytics, machine learning or data visualization products.

    Unlike separate servers for third-party products, DIAs help control data center sprawl, reducing racks, power and cooling, simplifying scaling of third-party elements of the application, and extending the single-point of administration of DCA Command Center to the added servers.

    Available Data Integration Accelerator (DIA) Modules are described below:

    Module Type

    DIA

    DIA Storage Server

    Size / Rack Configuration

    2U – Up to 11 modules/per rack 1

    8U –  4 modules /rack

    Number of Servers

    2 per module

    4 per module

    Total Number of Cores

    32 per module

    64 per module

    Total Memory

    128GB per module

    256GB per module

    Number of Data Storage Drives

    12 per module

    48 per module

    Storage Type

    3TB SATA 7200

    300 GB SAS 10K

    Data Storage Capacity

    2.7TB per module

    Application Development2

    Operating System

    Red Hat RHEL 6

    1Rack Configuration Varies by Rack and Other Modules In the Rack
    2 Data Storage Capacity Depends on File System Used and Settings


    Physical Architecture and Configuration


    All DCAs begin with a single rack. Whether initially a database or Hadoop-based appliance, both can be expanded with modules to add the other. Generally, additions are in ¼ rack, or 8U (rack unit) increments, but two module types are smaller and more dense, with somewhat different racking. Your Greenplum team will help you determine the racking arrangement for your particular needs.

    Common to all DCAs are Master Servers, Interconnect and Administration Switches.

    Master Servers

    All DCAs have 2 Master Servers, one primary and one standby, and the two perform somewhat different roles in the Greenplum Database, mixed function (Database + Hadoop) DCAs and GP HD DCAs.

    Interconnect Bus
    The Interconnect Bus consists of software and hardware that work together to provide inter-process communication between master servers and segment servers as well as between segment servers themselves, over a 10GB/second Ethernet network.

    The Interconnect Bus runs on a private network and is not usually connected to public or customer networks except when needed for high-speed parallel loading or unloading. The Interconnect Bus extends to all modules in the DCA, providing high-speed connectivity to Hadoop, ETL servers or analytics products running on DCAs as well as to externally integrated environments including Data Domain and Isilon.

    The Interconnect Bus consists of two 52-port 10GB Ethernet Switches and necessary cabling to connect all modules, via bonded Ethernet for redundancy and higher bandwidth. To maximize redundancy, a primary segment and its corresponding mirror segment use different interconnect networks. With this configuration, the DCA can continue operations in the event of a single Interconnect Bus failure.

    Administrative Switching

    Present in all racks of all DCAs, administrative switches provide a means for separating administrative traffic from data interconnects. This assures that administrative traffic never delays data interconnect traffic, and vice versa. Redundant 1GB Ethernet switches are used, connecting all modules and their servers, to master servers, and all master servers to all racks in the DCA.

    Physical Architecture: Greenplum Database DCAs

    In Greenplum Database DCAs, the master database runs on the primary Master Server (replicated to the standby Master Server) and is the entry point into the Greenplum Database for end-users and applications. Through the master database, clients connect and submit SQL, MapReduce, and other query statements.

    The primary Master Server is also where the global system catalog resides, that is, the set of system tables that contain metadata about the entire Greenplum Database itself. The primary Master Server does not contain any user data, data resides only on the Database Modules. The primary Master Server performs the following operations:

    • Authenticates client connections
    • Processes the incoming SQL, MapReduce, and other query commands
    • Distributes the work load between the Segment Instances
    • Presents aggregated final results to the client program

    The primary Master Server mirrors logs to the Standby Master Server so that it is available to take over in the event of a failure. The standby Master Server is kept up to date by a process that synchronizes the write-ahead-log (WAL) from the primary to the standby.

    Database Modules and their Segment Servers

    Each Database Module provides 4 segment servers with CPUs, memory and directly-attached RAID storage. Within database modules are segment servers, that process database data.

    Master Nodes use cost-based optimization and workload management to determine how may queries to run at any given time to maximize throughput. The majority of query processing occurs within segment servers as they process database data to satisfy the queries. Multiple queries can be processed at any give time, depending on workload and query complexity.

    To process data, all data is divided up into segment instances, and each segment server is responsible for processing a set of segment instances for all queries on that affect that segment instance. All segment instances contain roughly equal-sized data “slices” that include data from each table and index in the database. In this way, queries are executed in parallel by every segment server acting on one or more segment instances.

    Existence of segment servers and segment instances is transparent to users, and they need not interact directly with them. All query requests are expressed to the master in SQL, with no knowledge that execution will be conducted in parallel. Behind the scenes, the master nodes decompose query logic, recompiling it into a set of actions for each segment server to execute on one or more segment instance handled by that segment server. For some queries, communications is required for movement of data between segment servers, and this is done transparently during the query using the high-speed interconnect (described below) between all nodes, including masters and segment servers. Once all segment servers complete their work, results are returned to the master node, marshaled into a single response and returned to the requesting client.

    Segment Instance Redundancy

    There are two types of segment instances: primary segment instances (primary segments) and mirror segment instances (mirror segments). Each primary segment instance has a corresponding mirror segment instance. A mirror segment always resides on a different host and subnet than its corresponding primary segment. This is to ensure that in case of a segment server failover, where a segment server is unreachable or down, the mirror counterpart of the primary instance and its data is still available on another segment server, and the query task is then executed there.

    Physical Architecture: Greenplum Hadoop DCAs

    Greenplum Hadoop DCAs are similar in physical architecture to Greenplum Database, with the exception that initial configurations are larger.

    Greenplum offers storage options with Hadoop that affect the layout of systems.

    Greenplum Hadoop Server-based DCAs support traditional Hadoop, with directly attached storage in the modules. These systems begin with two ¼ rack modules occupying the lower half of the first rack.

    Dedicated to administration, the first module hosts Hadoop-wide functions including JobTracker, TaskTracker, Zookeeper, HBase core functions and others. The second module added is the first data module, and runs MapReduce and HDFS tasks on the data stored within the module.

    Unique to Greenplum is an optional Hadoop DCA configuration called HD Compute. Recognizing that some customers have stringent data protection requirements, Greenplum HD Compute offers the option to “export” the HDFS functions to enterprise-hardened Isilon OneFS Network Attached Storage servers in adjacent racks.

    Use of Isilon NAS for HDFS storage also frees users to scale compute capacity and storage capacity independently– where file storage can be added as needs dictate, without paying for unneeded computational capacity, and vice versa, HD Compute nodes can be added without affecting Isilon NAS storage for HDFS.

    HD Compute DCAs begin with 4 Greenplum HD Compute Modules installed in the lower half of the first rack, with the possibility to add up to 10 modules in the first rack, creating a 20-node Hadoop cluster. As with HD Server modules, the first module is dedicated to core Hadoop functions including JobTracker, TaskTracker, Zookeeper, HDFS administration etc. The remaining three modules are MapReduce computational nodes.

    DCA Expansion Configurations

    Once initial configurations have been selected, any combination of database, module and data integration modules can be selected, subject to racking considerations. Users are free to intermix modules so long as all Database Modules are of the same type (Standard or Compute) and all HD Modules are of the same type (HD Server or HD Compute). DIA Modules can be intermixed as needed within a single DCA.

    High Availability, Backup/Restore, and Disaster Recovery


    High Availability

    The DCA includes high-availability features such as hardware and data redundancy for both Database and Hadoop and resource failover for disks, switches, servers and master nodes used in both.

    Master Server Redundancy

    The DCA includes a standby Master Server to serve as a warm standby backup in case the primary Master Server becomes inoperative. Should the primary Master Server fail, the standby is available to take over as the primary.

    For DCAs with Greenplum Database modules, the Master Servers act as primary and secondary database masters and are kept up to date by a transaction log replication process that keeps the primary and standby masters synchronized. Should the primary Master Server fail, the log replication process is ceases and the standby it automatically activated. Upon activation of the standby, the replicated logs are used to reconstruct the state of the primary Master Server at the time of the last successfully committed transaction.

    Similarly, for DCAs that host Hadoop nodes, Hadoop administration nodes are also redundant, providing failover of key modules in the Hadoop cluster such as the Jobtracker, Zookeeper and file system NameNode.

    Managing Redundancy in Greenplum Database

    In Greenplum Database and the Administration Node in Hadoop no user data is stored, only metadata and administrative information needed by the DCA. Changes to this information are mirrored between primary and secondary administration nodes.

    User Data Redundancy in Greenplum Database

    Information stored in Greenplum Database has multiple layers of protection. Data is stored in RAID-protected disk drive groups, with hot spare drives ready, should it be necessary to replace a primary disk. Data is stored in RAID-protected disk drive groups, with spare drive waiting should it be required to replace a primary disk.

    The drive arrays are connected to database segment processors via redundant connections. All data segments are mirrored on two segment processors.

    The system automatically fails over to the mirror segment whenever a primary segment becomes unavailable. A DCA can remain operational if a segment instance or segment server goes down, as long as all portions of data are available on the remaining active segments. Whenever the master cannot connect to a primary segment, it marks that primary segment as down in the Greenplum Database system catalog and brings up the corresponding mirror segment in its place. A failed segment can be recovered while the system is up and running. The recovery process only copies over the changes that were missed while the segment was out of operation.

    In the event of a segment failure, the file replication process is stopped and the mirror segment is automatically brought up as the active segment instance. All database operations then continue using the mirror. While the mirror is active, it is also logging all transactional changes made to the data. When the failed segment is ready to be brought back online, administrators initiate a recovery process to bring it back into operation.

    Backup/Restore of Greenplum Database



    Figure: Disaster Recovery using Data Domain

    Greenplum DCA backs up all user data in parallel directly to Data Domain, deduplicating the data using Data Domain Boost or “ddboost”. With this backup and restore method, ddboost dramatically reduces backup storage and backup network requirements, enabling the DCA to achieve effective backup rates up to 14TB/hour, and supports both full and partial replication, with 1-to-1, 1-to-many, and cascading topologies. For more information on backup and restore with Data Domain, please refer to: White Paper: Backup and Recovery of the EMC Greenplum Data Computing Appliance Using EMC Data Domain — An Architectural Overview

    Disaster Recovery

    Figure: SAN Mirror – Greenplum DCA to Symmetrix VMAX

    Symmetrix VMAX provides SAN mirroring disaster recovery for a Greenplum DCA environment by leveraging separately licensed TimeFinder/SnapShots and SRDF/Synchronous Replication for DR.

    This method relocates mirror copies of data segment instances to reside on an EMC Symmetrix VMAX, rougly doubling available primary storage, VMAX is not used for reads/queries unless there is a segment failure. For more information on disaster recovery using EMC Symmetrix VMAX, please refer to the whitepaper, EMC Greenplum Data Computing Appliance Using SAN Mirror and EMC Symmetrix VMAX for Disaster Recovery.

    Greenplum DCA Software

    Greenplum DCA Software configures the physical components installed in the Greenplum DCA. It facilitates updates, monitoring, history tracking and notification functions. System Administrators interact with Greenplum DCA Software primarily thorugh Command Center, web-based, comprehensive administration console for the DCA described below.

    During operation, if abnormal indications are detected, DCA Software can also provide proactive alerting. Alerts can be delivered via EMC Dial-Home functionality via SNMP Alerts, and directly to users via Greenplum Command Center.

    EMC Automated Dial-Home Functionality

    The DCA dials out to EMC for the following hardware-related events: single-disk failure, virtual disk or RAID degradation, issues with the RAID controller battery, PSU, power usage, memory, NIC, CPU, fan, 10 GB switch (fan and PSU), or SNMP daemon not working.

    The following software issues also trigger dial-home events: Greenplum Database unavailable, segment database down, master failover notification, core file dump detected, kernel crash, or disk space over limit. The DCA also dials home weekly to test its connection with EMC.

    The DCA Software also delivers weekly dial-home reports that report disk usage, license report, database table size/index/overall capacity, number of functional modules and types, health state of the database, percent of hard drive full for all modules.

    Hardware monitoring in the DCA software reports condition and status information for approximately 150 different parameters for database modules, Hadoop modules and DIA modules. These highly-detailed MIBs report items ranging from hardware identification to physical operating parameters such as power supply status, voltages and current, cooling device(s) status to warn of impending problems, I/O rates, CPU usage percentages, etc.

    Measured system parameters are available through Greenplum Command Center, through Simple Network Management Protocol (SNMP) reporting, and through EMC’s Call Home functionality, helping DCAs to be easily integrated into most data center management frameworks.

    Admin Tools

    Admin Tools

    Greenplum offers two solutions for administering the Greenplum UAP with Greenplum Command Center and MoreVRP.

    Command Center

    Greenplum Command Center provides a single administrative interface whereby administrators can collect performance metrics, and view various performance analyses through interactive dashboards.

    Greenplum Command Center has data collection agents that run Greenplum Database, Greenplum HD (Hadoop) and DIA modules that collect performance data, capacity data and physical hardware status and send the data to the Greenplum Command Center at regular intervals. For GP-DB, Command Center also collects and reports on the details of query execution and system utilization.

    Greenplum Command Center provides a graphical web-based user-interface through which administrators can:

    • View historical system metrics
    • View alerts and Dial-home information
    • View performance data for the Greenplum DCA
    • Enable SNMP reporting and alerting for DCA and its modules
    • Monitor database query performance, optimization and historical trends
    • Perform workload management
    • Monitor the physical health of the DCA

    Command Center enables administrators to monitor historical trends and manage system health for the entire Greenplum product line.

    MoreVRP

    MoreVRP, provides additional database-specific monitoring and optimization capabilities including:

    • Database Virtual Database Resource Partitioning
    • Detailed Real-Time Execution Monitoring
    • Query Execution Analytics with Drill-Down capabilities

    With MoreVRP, administrators can:

    • Optimize database performance
    • Manage Quality of Service
    • Track adherence to service level agreements (SLAs)
    • Reduce risks associated With new deployments

    Back to Top