3. 3Pivotal Confidential–Internal Use Only
HDFS
HBase
Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache Pivotal HD Added Value
Configure,
Deploy, Monitor,
Manage
Command
Center
Hadoop Virtualization (HVE)
Data Loader
Pivotal HD
Enterprise
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ– Advanced
Database Services
Pivotal HD Architecture
4. 4Pivotal Confidential–Internal Use Only
• HDFS – The Hadoop Distributed File System
acts as the storage layer for Hadoop
• MapReduce – Parallel processing framework
used for data computation in Hadoop
• Hive – Structured, data warehouse
implementation for data in HDFS that
provides a SQL-like interface to Hadoop
• Pig – High-level procedural language for data
pipeline/data flow processing in Hadoop
• HBase – NoSQL, key-value data store on top
of HDFS
• Mahout – Library of scalable machine-
learning Algorithms
• Spring Hadoop – Integrates the Spring
framework into Hadoop
Pivotal HD Components
5. 5Pivotal Confidential–Internal Use Only
• Installation and Configuration Manager (ICM) – cluster
installation, upgrade, and expansion tools.
• GP Command Center – visual interface for cluster health,
system metrics, and job monitoring.
• Hadoop Virtualization Extension (HVE) – enhances
Hadoop to support virtual node awareness and enables
greater cluster elasticity.
• GP Data Loader – parallel loading infrastructure that
supports “line speed” data loading into HDFS.
• Isilon Integration – extensively tested at scale with
guidelines for compute-heavy, storage-heavy, and
balanced configurations.
• Advanced Database Services (HAWQ)– high-performance,
“True SQL” query interface running within the Hadoop cluster.
• Extensions Framework (GPXF) – support for HAWQ
interfaces on external data providers (HBase, Avro, etc.).
• Advanced Analytics Functions (MADLib) – ability to access
parallelized machine-learning and data-mining functions at
scale.
GPHD Includes… Pivotal HD Adds the Following to GPHD…
Pivotal HD Value-Added Components
6. 6Pivotal Confidential–Internal Use Only
Component Version
Hadoop 1.0.3
HBase 0.92.1
Hive 0.8.1
Mahout 0.6
Pig 0.9.2
Zookeeper 3.3.5
Flume 1.2.0
Sqoop 1.4.1
Spring Hadoop
GPHD 1.2 Core Distribution Pivotal HD Enterprise
Pivotal Core Components & Versions
Component Version
Hadoop 2.0.2
HBase 0.94.2
Hive 0.9.1
Mahout 0.8.0
Pig 0.10.0
Zookeeper 3.4.5
Flume 1.3.1
Sqoop 1.4.2
Spring Hadoop 1.0.0
7. 7Pivotal Confidential–Internal Use Only
DataLoader
.
.
Streams
Push
Pull
Connectors
Flume
HDFS
DataLoader
Data Source
Registration
Copy
Strategy
Optimization
Web GUI and CLI
Data
Destination
Registration
Data Copy
Job
Management
Data
Processing
REST APIs
Files
HDFS
NFS
HTTP
FTP
Local
8. 8Pivotal Confidential–Internal Use Only
Command Center
Simple and complete cluster management
Install and configure Hadoop
components and services
Centralized interface for Pivotal HD
cluster monitoring, diagnostics, and
management
Live and historical Hadoop system
metrics analysis
Configure
Monitor
Manage
Analyze
Deploy
9. 9Pivotal Confidential–Internal Use Only
Command Center – Monitor, Manage, and
Analyze
Host, application, and job level
monitoring across the entire Pivotal
HD cluster performance
Visualize and analyze live and
historical Hadoop cluster information
through Command Center
Dashboard
Quick diagnostics of functional or
performance issue
10. 10Pivotal Confidential–Internal Use Only
Hadoop Virtualization Extensions (HVE)
• HVE enables Hadoop to support more effective virtual deployments
• This creates the opportunity to provision and scale the compute and storage processes
independently resulting in:
• Much better resource utilization
• Improved resource allocation and consumption
• Support Multi-Tenancy
11. 11Pivotal Confidential–Internal Use Only
HAWQ Delivers
SQL compliant
World-class query optimizer
Interactive query
Horizontal scalability
Robust data management
Common Hadoop formats
Deep analytics
12. 12Pivotal Confidential–Internal Use Only
Xtension Framework
An advanced version of GPDB
external tables
Enables combining HAWQ data and
Hadoop data in single query
Supports connectors for HDFS,
Hbase and Hive
Provides extensible framework API to
enable custom connector
development for other data sources
HDFS HBase Hive
Xtension Framework
Notes de l'éditeur
Start with basic HD and then comment about the addition of a true SQL interface
Uniform – Uniformly distribute copy tasks between workers for to maximize throughput
Data Locality Driven – For HDFS/Local Disk sources. Job planner assigns Copy Tasks to Local / closest Worker Node. When deployed on source, assigns reads to local worker, when deployed to destination HDFS, assigns writes to local nodes. Job Planner gets locality information from NameNode. Patches to HDFS schedulers:
HDFS rack awareness to reduce inter-rack traffic
Local disk awareness to assign read/write MapReduce tasks to workers local to the data
Dynamic – Used for loading large number of small files. Assigns small tasks to workers, re-assigns in runtime
Connection Limited – Limit # read connections to source FTP/HTTP server
Intelligent – Choose correct copy strategy based on source type and data
ICM
What is it – GPHD Manager for Greenplum HD
GPHD Manager is a part of Command center package.
GPHD Manager supports installation and default configuration of Hadoop, MapReduce, Hive, HBase, Zookeeper, Pig and Mahout
GPHD Manager supports a Command Line Interface built using a RESTful web services API to install, configure and start/stop various Hadoop services
It stores all metadata from the Hadoop cluster nodes and services into a postgresql db to keep track of cluster config and runtime stats
How it works
GPHD manager is installed on an Admin node that is typically separated from Hadoop cluster nodes
Functionality of GPHD manager Admin node exposed as web service REST based APIs
Leverages Puppet to manage the installation of Hadoop services (Master/Slave mode)
Benefits
Provides a centralized role-based configuration and deployment tool
Includes validation – machine validation, reachability validation, dependency validation
Single GPHD manager Admin node can manage multiple Hadoop clusters (integrated into GP Command Center in next release)
Command Center
What is it – Application for monitoring & management of GPHD environment
Web-based interface that provides standard infrastructure system metrics and Hadoop-specific metrics
Designed to make a Hadoop Administrator’s job easier
Command Center How it works
Visualizes live and historical data in GPCC Dashboard to display state of Hadoop cluster (stored in backend GPDB)
GPCC Provides:
Host level monitoring (all information specific to a particular host)
Application level monitoring (HDFS information across the whole cluster)
Job-level monitoring and analysis (information on particular MapReduce jobs)
Benefits
Monitor that the GPHD cluster is running efficiently and without any problems
Quickly diagnose functional or performance issues with the Hadoop cluster
GPSM
What is it – GPHD System Management & Monitoring
Web-services component of GPHD 1.2 that allows for applications to easily monitor and manage one or more Hadoop clusters
GP-SM is designed to work with Greenplum Command Center as the UI
Leverages GPDB to store/analyze both GPHD application and system metrics
How it works
GP-SM provides a Thrift Plugin to retrieve data from GPHD
Live and historical data stored in GPDB instance with pre-defined schema (gpperfmon)
Thrift Plugin exposes its APIs via Web Services
GPCC uses this Web Service to visualize live and historical data of GPHD environment
Benefits
Serves as the backend system for Greenplum Command Center
Enables users to analyze both live and historical Hadoop system information
Command Center
What is it – Application for monitoring & management of GPHD environment
Web-based interface that provides standard infrastructure system metrics and Hadoop-specific metrics
Designed to make a Hadoop Administrator’s job easier
Command Center How it works
Visualizes live and historical data in GPCC Dashboard to display state of Hadoop cluster (stored in backend GPDB)
GPCC Provides:
Host level monitoring (all information specific to a particular host)
Application level monitoring (HDFS information across the whole cluster)
Job-level monitoring and analysis (information on particular MapReduce jobs)
Benefits
Monitor that the GPHD cluster is running efficiently and without any problems
Quickly diagnose functional or performance issues with the Hadoop cluster
GPSM
What is it – GPHD System Management & Monitoring
Web-services component of GPHD 1.2 that allows for applications to easily monitor and manage one or more Hadoop clusters
GP-SM is designed to work with Greenplum Command Center as the UI
Leverages GPDB to store/analyze both GPHD application and system metrics
How it works
GP-SM provides a Thrift Plugin to retrieve data from GPHD
Live and historical data stored in GPDB instance with pre-defined schema (gpperfmon)
Thrift Plugin exposes its APIs via Web Services
GPCC uses this Web Service to visualize live and historical data of GPHD environment
Benefits
Serves as the backend system for Greenplum Command Center
Enables users to analyze both live and historical Hadoop system information
Topology Extensions:
Enable Hadoop to recognize additional virtualization layer for read/write/balancing
Enable compute/data node separation without losing locality
Elasticity Extensions:
Ability to adjust resource allocation (CPU, memory, map/reduce slots) to compute nodes
Enables multiple compute VMs sharing common HDFS Data VMs
HVE – Allow the HDFS to be virtualization aware
Serengeti – Deployment tool for virtualized hadoop
This is the first true SQL engine for hadoop
HDFS
Delimited Text
Sequence File
GPDB Writable Format
Protocol Buffer
Avro
Hbase
Predicate Pushdown
Hive
RCFile
Text File
Sequence File