INTRODUCING THE GREENPLUM MODULAR DATA COMPUTING APPLIANCE The industry's first complete big data analytics appliance
Unified for Big Data Analytics
Advanced all-in-one appliance to deliver a fast loading, highly scalable, data co-processing platform for Big Data analytics.
Purpose-built modular appliance
Modular solution including Greenplum Database for structured data, Greenplum HD for unstructured data, and DIA Modules for partner applications.
Fast, Flexible Appliances for Big Data Analytics
Big Data analytics is a fast-changing field where future needs are difficult to predict. To be successful, users need flexible, cost-effective solutions that that can grow and adapt in response to future requirements.
Results-Focused Outcome
Hand-building analytics platforms in-house can be risky as unanticipated testing, tuning, debugging and administration challenges delay delivery of results.
To overcome these risks and address urgency and cost-containment needs, pre-packaged analytics appliances have evolved to reduce deployment time and administrative overhead, but often at the cost of flexibility and future adaptability.
A Data Computing Appliance that Grows with Your Needs
The Greenplum Data Computing Appliance (DCA) is a flexible multifunction analytics platform that supports structured and unstructured data analytics plus ETL, BI, machine learning and data visualization in a flexible, adaptable and high-performance appliance. Delivering the rapid deployment and administrative simplicity of an appliance, DCAs are flexible – users can tailor capabilities and capacity both initially and over the lifetime of the appliance.
Flexibility Through Modular Design
Unlike more rigid appliances, DCA users specify the capacity and capabilities of their appliance by selecting from a set of available DCA modules. Available modules support Greenplum Database for analysis of structured data, Greenplum HD for Apache Hadoop-based analysis of unstructured data, as well as a variety of data staging, ETL, BI and analytical tools from Greenplum partners.
Rapid Results
Greenplum DCAs are designed, optimized, tested, debugged, and delivered ready to run, shortening the time needed to delivery Big Data analytics to business stakeholders.
Modular Scalability Over Time
DCA users can specify modest initial configurations knowing that they can easily add capacity or new capabilities later. Once a DCA is installed, additional modules can be added to scale the computational capability and storage capacity of an appliance. Similarly, modules of new types can also be added to expand the capabilities of the overall appliance such as adding modules to support Hadoop processing or additional ETL capability or capacity. Modular scalability is straightforward and does not require extended downtime or component replacement.
Keeping Big Data Available
DCAs mitigate risks posed by system outages. Redundant design throughout the appliance eliminates single points of failure, providing a dependable repository for mission critical information.
Big Data stored in DCAs can be further protected from data center disasters by integrating EMC storage products that enhance availability, through replication to geographically diverse standby systems.
Regardless of number and type of modules used, a single administration tool manages all modules, and all modules are integrated using a high-bandwidth network interconnect.
Protecting Big Data from Prying Eyes
Big Data assets increasingly contain private data, requiring protection from prying eyes and malicious hands.
The Greenplum DCA is hardened against known vulnerabilities, and new threats are mitigated through ongoing software updates and patches.
The Greenplum Data Computing Appliance (DCA) Modules:
For SQL and Analytical Processing:
GREENPLUM DATABASE MODULES
Greenplum Database Modules offer industry-leading price-performance for SQL and analytical processing and are available in two densities. Database Standard Modules store 110+ TB of user data per ¼ rack module for data intensive applications. Database Compute Modules store up to 36 TB of user data per ¼ rack module, offering a lower price point for compute-intensive or user-intensive applications.
Each module contains multiple servers, CPUs, memory, disk I/O, and interconnects are optimally balanced for Greenplum Database. From 1 to 48 modules can be integrated, yielding 36TB to 5PB of total user data capacity, and 64 to over 3000+ CPU cores of total compute capacity.
Database Modules deliver industry-leading price performance while redundant servers, automatic failover and automatic sparing of storage devices assure availability and minimize downtime.
For Unstructured Data Processing in Hadoop:
GREENPLUM HD MODULES
Greenplum HD Modules bring Hadoop processing into the DCA independently, or integrated with Greenplum Database.
Available in two forms, HD Modules give users a choice of storage technology. Greenplum HD Modules support traditional Hadoop processing, where each module includes computation, interconnect and direct-attached storage.
Greenplum HD Compute Modules also support complete Hadoop-based processing, but only small amounts of local storage. Instead, they transparently access HDFS data managed by EMC Isilon scale-out Network-Attached Storage (NAS) devices. Eliminating local storage enables HD Compute Modules to offer higher compute density per rack while leveraging the enterprise-class redundancy, availability and storage optimization technologies of EMC Isilon.
HD Compute Modules are integrated with the Greenplum Database and partner applications running on Data Integration Accelerator Modules via a high-speed interconnect throughout the DCA.
Both HD and HD Compute modules run Greenplum HD, an Apache-compatible Hadoop distribution, both can scale linearly from terabytes to petabytes of data, and both are easily administered as part of the Greenplum DCA.
For Partner ETL, staging, BI, analytics and data visualization:
GREENPLUM DATA INTEGRATION ACCELERATOR (DIA) MODULES
Greenplum Data Integration Accelerator (DIA) Modules host a variety of partner software products within the DCA, ranging from simple file systems to complex ETL, BI, analytics and data visualization solutions.
DIA configurations can range from small (2U) modules with modest resources, to high-capacity storage modules for data staging or ETL and large-memory modules for compute-intensive applications like data visualization and machine learning.
Integrating partner applications with the DCA using DIA Modules maximizes performance and manageability while reducing system cost and data center footprint.
Product Highlights
Extreme Performance with Elastic Scalability
Greenplum® Database and Greenplum Hadoop are at the heart of the Greenplum Data Computing Appliance (DCA), offering both structured and unstructured data analytics. They offer shared-nothing, massively parallel processing (MPP) architecture optimized for BI and analytical processing. Both use MPP to achieve near-linear scalability, with performance and capacity increasing linearly with the number of nodes.
Move Computation To The Data
Another core principle of Greenplum software is to collocate processing and data. Doing so enables DCAs to process database queries and MapReduce tasks in a fully parallel manner, using multiple, node-by-node storage connections simultaneously to acquire data for analysis. Collocation of computation and storage yields the greatest possible performance for analytical processing, while alleviating computation and network bandwidth burdens on other data center infrastructures.
Modular Design and Integration Facilitates Data Co-Processing
Growth of unstructured data leads many users to seek a combined solution for processing structured and unstructured data. In this scenario, unstructured data is processed in Hadoop, and resulting extracted structured data is merged with database data for further analysis. The Greenplum DCA provides fast parallel connections between Database and Hadoop Modules, enabling co-processing of varied data types. Greenplum Database transparently and efficiently queries data in both the database and Hadoop without duplication.
Enterprise High Availability
The Greenplum DCA meets the reliability requirements of the most mission-critical enterprises by delivering multi-level, self-healing fault tolerance, which includes automated failover, fully online self-healing resynchronization, and multiple levels of redundancy and integrity checking. Data availability employs hardware RAID protection with hot spare drives on standby for minimum recovery time. This ensures no data loss and minimal performance reduction during automated RAID disk rebuilds.
Reliable Backup and Increased Availability
Enhanced availability and disaster recovery are available for Greenplum Database Modules through EMC Symmetrix VMAX. VMAX snapshots and Symmetrix Remote Data Facility (SRDF) replicate data from the Greenplum Database to distant data centers for archival and maintenance of another DCA in warm standby.
Multiple HA configurations are available for Hadoop users. Users of Greenplum HD Modules enable Hadoop software to maintain a configurable number of data replicas, distributed throughout the Hadoop cluster. Users of Greenplum’s HD Compute Modules (which rely on EMC Isilon for HDFS storage) can leverage Isilon’s redundant storage and distant replication to maintain archives as well as a warm standby replica of the primary Hadoop cluster and its data.
Proactive EMC One Support Structure
Customer Support Services provide the resources and services to quickly and proactively resolve solution-related issues and questions, thereby assuring business continuity and a highly available data environment. EMC’s global maintenance and support is available around-the-clock 24x7 via live chat, online service request management, live telephone support, and onsite support through the industry’s leading global field service organization.
In addition, the Greenplum DCAs support EMC Secure Remote Support (dial-home). Through this feature, the appliance provides around-the-clock remote and pre-emptive troubleshooting by automatically alerting the EMC Support Center of critical hardware and software errors for Greenplum Database, HD and DIA Modules. The EMC Support Center can then remotely diagnose issues and dispatch replacement hardware and support technicians as required.
DCA Cluster Configuration
Greenplum DCAs can begin as small as ¼ rack and be expanded to as many as 12 racks, combining one or multiple module types. All modules within the DCA share a redundant 10GB Ethernet interconnect for maximum co-processing performance and availability. Administration is conducted over a separate 1GB Ethernet interconnect to support single-point administration of the entire DCA.
DCAs with Greenplum Database Modules begin with initial host processor nodes, a primary and a standby and a minimum of one ¼ rack Database Module in the first DCA rack. From the initial configuration, up to 48 total database modules can be added. As additional modules are added to the database, they add segment processor nodes to the MPP database cluster, to a maximum of 48 modules or 12 total racks.
Greenplum HD Modules work somewhat differently, not requiring a separate host processor pair. Instead they dedicate the first Hadoop Module to support JobTracker and NameNode functions. A minimum Hadoop configuration therefore begins with two modules, scaling to a maximum of 12 racks, just as does Greenplum Database.
Whether your ideal DCA configuration is pure database, pure Hadoop, or more commonly, a mixed cluster of module types, required network switching is included in each rack as needed, to maximizes performance while maintaining redundancy, across a single-rack or 12 racks.
Greenplum Data Computing Appliance
Product overview of the Greenplum Data Computing Appliance: UAP Edition. The EMC® Greenplum® Data Computing Appliance (DCA) is an integrated analytics platform that accelerates analysis of Big Data assets within a single integrated appliance. Integrating Greenplum Database for analytics-optimzed SQL on structured data, Greenplum HD for Hadoop-based processing of unstructured data and Greenplum partner analytics, BI and ETL applications provides flexibility.
Product overview of Greenplum Database - massively parallel processing (MPP) database built to support the next generation of “Big Data” warehousing and large-scale analytics processing.
Whitepaper
Backup and Recovery of the Greenplum DCA Using EMC Data Domain
Insight into how EMC Data Domain de-duplication storage systems effectively deal with data growth, retention requirements and recovery service levels essential to businesses.
Case Study
Greenplum Data Computing Appliance reduces data latency, puts data to work faster, and drives competitive advantage.
Video
Brainpad: Using Big Data to Increase Marketing ROI
The Cloudstock solution from Brainpad leverages Big Data to help organizations improve their marketing return on investment. Using the Greenplum Data Computing Appliance, Brainpad has increased performance and reduced data upload times from two days to 20 minutes.
Blog
Flexible Appliances: Conundrum, Oxymoron, or Opportunity?
Some argue that the term “Flexible Appliance” is an oxymoron, often with good reason. Among analytics users, the emergence of preconfigured analytics appliances has been a mixed blessing. While they're simpler to deploy than build-it-yourself MPP clusters, many appliances force users to trade-off flexibility against simplicity.