Contenu connexe
Similaire à Greenplum hadoop
Similaire à Greenplum hadoop (20)
Plus de Chiou-Nan Chen (20)
Greenplum hadoop
- 2. 整合分析結構與非結構
性資料暨應用案例
Greenplum
Enable Big Data Analytics
邱垂吉 Jimmy Chiu
技術顧問/EMC Greenplum Taiwan
© Copyright 2012 EMC Corporation. All rights reserved. 2
- 3. Volume, Variety, Velocity, Value +
Complexity
New insights on Contextual and
customers, products, Velocity Volume location-aware
and operations delivery to any
Big Data device
Variety Complexity
Documents Transactional Smart Grid Images Audio Text Video
Data
• Volume: data volumes approaching multiple petabytes
• Velocity: data being generated and ingested for analysis in real-time
• Variety: tabular, documents, e-mail, metering, network, video, image,
audio
• Complexity: different standards, domain rules, and storage formats per
data type
Gartner March 2011
© Copyright 2010 EMC Corporation. All rights reserved. 3
- 4. Sample Big Data Scenarios
LOAN PROCESSING AUTO INSURANCE SMART GRID ANALYTICS
IN BANKING IN P&C INSURANCE IN UTILITIES/ENERGY
REAL-TIME STATISTICAL
PROACTIVE EMERGENCY RESPONSE VIDEO ANALYTICS
IN HEALTHCARE IN RETAIL
PROCESS CONTROL
IN MANUFACTURING
© Copyright 2010 EMC Corporation. All rights reserved. 4
- 5. Big Data Analytics For Competitive
Advantage Suppliers Suppliers
Who are my
most valuable
Manufacturing customers? Manufacturing
Inventory
Inventory
Physical Assets Physical Assets
What are my most Distribution
important Services
Distribution
products? Personal
Marketing
Services
Mass Additional
Marketing Profits
What are my most
successful
campaigns?
Customers Customers
Today’s Business Model Big Data Analytics Business Model
© Copyright 2010 EMC Corporation. All rights reserved. 5
- 6. Big Data meets Fast Data
Social and Personal – Every
Minutes:
•Google gets more than 2 million search
queries
•About 47,000 people download an App
•Some 100,000 tweets hit Twitter
•Almost 300,000 people log on to
Facebook
Business and Transactional:
•CERN (European Organization for Nuclear
Research) generates 40TB/sec of scientific
data
•Wal-Mart – 1 million transactions per hour
•World’s top systems currently trade at
faster than 50 microseconds
•New York Stock Exchange generates 1TB of
new trading data daily
© Copyright 2010 EMC Corporation. All rights reserved. 6
- 7. Working together, they enable entirely
New Business Models
Big Data allows you to find
opportunities you didn’t know
you had.
Fast Data allows you to respond
to opportunities before they are
gone.
In the Financial Services
Industry, large quantities of
historical data need to be
processed against a growing number
of fast-moving data feeds.
Batch processing is no longer a
suitable solution!
© Copyright 2010 EMC Corporation. All rights reserved. 7
- 8. Effective Customer Segmentation is all
about blending Structured and
Unstructured Data
– Transaction data (structured data) tells you what the customer
did.
– Unstructured data can tell you why they did it, why some others
did not, what else they need or want, and what problems they may
have.
© Copyright 2010 EMC Corporation. All rights reserved. 8
- 9. Big Data Architecture Solving Big Data challenge
involves more than just
Requirements managing volumes of data.
― Gartner
• Multiple data types: structured, semi-structured,
unstructured
• Integrated data stores: real-time, traditional,
data warehouse
• Modern development tools: Java, lightweight
messages, mobile-enabled
• Cloud-enabled: elastic scale, self-healing
Beware point solutions – integration is critical!
© Copyright 2010 EMC Corporation. All rights reserved. 9
- 12. Architecture of Greenplum
Flexible framework for processing large datasets
Process large datasets with support for SQL
both SQL and MapReduce MapReduce
Master Master
Master servers optimize queries
for the most efficient query execution
Interconnect for continuous
pipelining of data processing
Segment servers process queries
close to the data in parallel
MPP Scatter/Gather streaming for
fast loading of data
© Copyright 2010 EMC Corporation. All rights reserved. 12
- 13. Greenplum MPP Share-Nothing Arch.
MPP
Share Share Disk Share nothing
everything eg: eg:
eg: Oracle RAC Greenplum
Unix server
Intranet
Master
Intranet
DB DB DB DB DB
DB DB DB DB
SAN/FC
Disk SAN
Disk Disk Disk Disk
Share disk
© Copyright 2010 EMC Corporation. All rights reserved. 13
- 14. Benefits of the Greenplum Database
Architecture
• Simplicity
– Parallelism is automatic – no manual partitioning required
– No complex tuning required – just load and query
– HA
– Best of breed x86 and Ethernet networking technologies
• Scalability
– Linear scalability
– Each node adds storage, query performance, loading performance
• Flexibility
– Fully parallelism for SQL92, SQL99, SQL2003 OLAP, MapReduce
– Any schema (star, snowflake, 3NF, hybrid, etc)
– Rich extensibility and language support (Perl, Python, R, C, etc)
– Structure, semi-structure and unstructure
© Copyright 2010 EMC Corporation. All rights reserved. 14
- 15. Greenplum and Hadoop
Analytics
Semi-Structured
Structured Machine Data
UnStructured
ERP/CRM Logs Images/Sound
Ad-hoc Analysis batch reporting on static data
Dynamic Data
© Copyright 2010 EMC Corporation. All rights reserved. 15
- 16. Big Data Analytics
The Power of Data Co-Processing
Greenplum Chorus
Analytic Productivity & Tool Integration
End-to-end Platform Management & Control
Data Access And Query
Greenplum Commander
SQL, MapReduce, SAS, MADLib, Mahout, R, and others
SQL Engine MapReduce Engine
parallel For Unstructured Data
For Structured Data
data exchange •Enterprise ready Apache
• In-database Advanced
Analytics Hadoop
• Extreme performance on •Faster, more dependable, and
commodity hardware parallel easier to use
data exchange
Greenplum Database Greenplum Hadoop
Network
Parallel Loading Of
All Data Types
© Copyright 2010 EMC Corporation. All rights reserved. 16
- 17. Greenplum Hadoop
• Greenplum HD
– Enterprise-ready Apache Hadoop
– Proven at Scale in 1,000 node Analytics
Workbench
– Single product with 2 storage options (Isilon &
HDFS)
• Enterprise Edition becomes
Greenplum MR:
– Advanced features
– 100% API compatible
– Software-only product
© Copyright 2010 EMC Corporation. All rights reserved. 17
- 18. AWB Update
Analytics Workbench Operational!
•1025 nodes operational
•1011 nodes with GPHD installed
•8 total projects have been on boarded from university
collaboration to partner technology evaluation
Proposals accepted by customer engagement team –
info@analyticsworkbench.com
•Engagement team will learn project objectives
•JEDI council approves/disproves project based on technical
feasibility and alignment with company goals
•Projects informed of decisions and timelines
Cluster access via - http://portal.analyticsworkbench.com/
© Copyright 2010 EMC Corporation. All rights reserved. 18
- 19. Apache Hadoop Pain Points
• Poor Job and Application Monitoring
Monitoring Solution
• Non-existent Performance Monitoring
Operability • Complex System Configuration and
Manageability
and • No Data Format Interoperability &
Manageability Storage Abstractions
• Poor Dimensional Lookup Performance
Performance • Very poor Random Access and Serving
Performance
© Copyright 2010 EMC Corporation. All rights reserved. 19
- 20. Greenplum MR:
Enterprise Edition Stack
100%
APACHE
Enhanced Monitoring
INTERFACE
Hive
Pig
HBase
Zookeeper
MapReduce Framework (MapRed)
Distributed File System
© Copyright 2010 EMC Corporation. All rights reserved. 20
- 21. Greenplum MR: Enterprise Edition
Enterprise-Ready Hadoop Platform for Unstructured Data
• 2 – 5x Faster than Apache
Faster Hadoop
• High Availability
Reliable • Mirroring
Easier to • NFS mountable
Use • Graphical System Management
© Copyright 2010 EMC Corporation. All rights reserved. 21
- 22. Greenplum MR
Simple Management
• Health
Monitoring
• Cluster
Administratio
n
• Application
Provisioning
© Copyright 2010 EMC Corporation. All rights reserved. 22
- 24. Greenplum MR Delivers True Return on
Investment
• NFS direct access to simply load and access
data directly in a Hadoop cluster
• Enables standard tools and utilities to work
directly on data contained in Hadoop
• Heatmap user interface provides full cluster
visibility and control.
• Eliminates all single points of failure
• High Availability for Job Tracker , NameNode &
NFS
• Snapshots allow point-in-time data protection
and recovery.
• Mirroring for business continuity includes wide
area replication support.
• Speeds jobs by 2X – 5X
• Provides faster performance with ½ the
hardware
• Substantial capital and operating expense
savings
© Copyright 2010 EMC Corporation. All rights reserved. 24
- 25. EMC Greenplum
Fastest data loading Advanced analytics
DATA IN IN-DATABASE ANALYTICS DECISIONS OUT
Scatter/Gather Streaming Optimized for fast query execution Unified data access for greater
technology for the world’s and linear scalability insight and value from data
fastest data loading •Move processing closer to data •Enable parallel analysis
•Eliminate data load •Shared-nothing, massively across the enterprise
bottlenecks parallel processing (MPP) •Open platform with broad
•Clean and integrate new data scale-out architecture language support
•Several loading options, •Computing is automatically •Certified enterprise
ranging from bulk load optimized and distributed connectivity and integration
updates to micro-batching for across resources with most business
near real-time processing • Provides the best concurrent intelligence; extract,
multi-workload performance transform, and load (ETL);
and management products
© Copyright 2010 EMC Corporation. All rights reserved. 25
- 26. EMC Big Data Analytics Reference
Architecture
Data Sources Hadoop Alerts
Statistics
Reduce
Documents
Genetic Algorithms
Map-
Map-
Ecosystem* HDFS
Reduce Dashboards
Mobile
Key Values Documents Other NoSql
Machine Reports
Data Mining
Data
Quality NoSQL Stores
Multimedia parallel
data exchange Spreadsheets
SQL Stores
Web/Social
OLAP
BU 1
Operations Research
Data Marts
LOB data
MDM Mobile
Enterprise
Data BU 2
ERP Warehouse
Neural Nets
BU 3
ETL Data Visualization
CRM
Federated
BI as a
Data
Service
POS Warehouse
Data Data Stores and Data Presentation &
Integration
Input Access Analysis Delivery
Structured Traditional data Traditional data Big data analytics
data sources Integration warehousing ramifications
*Hadoop Ecosystem includes: Hive, Pig, Mahout, HBase, ZooKeeper, Oozie, Sqoop, Avro
© Copyright 2010 EMC Corporation. All rights reserved. 26
- 27. Architecture for Business Value
Business Value
Chorus for Collaboration Analytics
Analytics
Self-develop app Self-develop app
Java API Analytics tools Analytics tools
JDBC
(Mahout) (SAS, R, MADlib and more)
ODBC
Hbase
.csv SAS & MADlib
.txt GPDB - In GPDB
- In Memory
MapRFS
(GPMR) ETL
MapRFS: C++; MR: C++
x
Load Performance: 2~5X DB’s
Files High Availability
Stable
© Copyright 2010 EMC Corporation. All rights reserved. 27
- 28. Big Data And EMC
4 New Analytic Applications
Data Science 3
2 Unified Analytics Platform
Petabyte Scale Data Storage 1
© Copyright 2010 EMC Corporation. All rights reserved. 29
- 29. SAS / Greenplum Product Overview
SAS High Performance Computing
SAS Access for SAS In-Database SAS In-Memory
Integration Processing Analytics
Provides integration capability to Requires SAS Enterprise Miner in New functionality from SAS that
a number of databases order to be of value requires dedicated database
appliance
Allows for increased performance Will lead to significant Very high performance for business
of Base SAS Procs improvement in performance users that can significantly
increase revenues or decrease
costs as a result of improved
performance
Products: SAS Access for Greenpum Products: SAS Access for Products: SAS Access for
Greenplum, SAS Grid Manager, SAS Greenplum, SAS Grid Manager, SAS
Enterprise Miner, SAS Scoring High Performance Analytics
Accelerator for Greenplum
© Copyright 2010 EMC Corporation. All rights reserved. 30
- 30. SAS and Greenplum UAP Integrated Architecture
Data Data Data Bl LOB
Scientist Engineer Analyst Analyst User
SAS Business Intelligence
DATA SCIENCE TEAM
Greenplum Chorus - Analytic Productivity Layer
SAS Analytics
Data Access & Query Layer (SAS ACCESS, SQL, MapReduce)
Greenplum Database Greenplum Hadoop
Private/Hybrid Cloud Infrastructure or Appliance
Data
Platform
Admin
SAS Information Management
© Copyright 2010 EMC Corporation. All rights reserved. 31
- 31. In A Single Unified Analytics Platform
Self-Service
Iterative, Agile
Transparent, Real-time Collaboration
Structured & Unstructured Data
Analyze Petabytes Of Current Data
Virtual, Scale Out Architecture
© Copyright 2010 EMC Corporation. All rights reserved. 32