SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Greenplum Analytics
                                            Workbench


                                               APURVA DESAI




© Copyright 2012 EMC Corporation. All rights reserved.            1
Overview




© Copyright 2012 EMC Corporation. All rights reserved.              2
What is Hadoop?
 What is Hadoop?
        –    Distributed computing paradigm
        –    File system – HDFS
        –    Processing framework –Map Reduce
        –    Languages – PIG, HIVE
        –    Key Value Store – Hbase
 Why is it important?
        – BIG Data is everywhere
        – BIG Data is mostly unstructured
        – Need affordable, scalable no-sql processing


© Copyright 2012 EMC Corporation. All rights reserved.   3
Analytics Workbench - Motivation
 Open source
        – Hadoop industry is nascent
        – BIG Data development needs scale


 Greenplum
        – Innovation & Experimentation platform
        – Contribute to the community
        – GPDB & GPHD - Mixed mode environment




© Copyright 2012 EMC Corporation. All rights reserved.   4
Greenplum Vision




© Copyright 2012 EMC Corporation. All rights reserved.   5
Buildout Pre-requisites
 Hardware systems integration


 Hadoop experience


 Program Management


 Partner ecosystem

          Greenplum has Inhouse Expertise

© Copyright 2012 EMC Corporation. All rights reserved.   6
Team Introduction
                                                          System Integration
                                                           – Greg, Eric, Don, Dave,
                                                             Patrick



                                                          Program Management
                                                           – Mike, Joe



                                                          Hadoop
                                                           – Apurva, Judes, Clinton,
                                                             Chandra, Ashwin




© Copyright 2012 EMC Corporation. All rights reserved.                                 7
Partners
                                                          Intel
                                                            – 2000 Westmere CPUs

                                                          Mellanox
                                                            – 1,000+ NICs
                                                            – 72 IB switches

                                                          Micron
                                                            – 6,000 8GB DRAM

                                                          Seagate
                                                            – 12,000 2TB Drives

                                                          Supermicro
                                                            – 1000 Chasis/MB


© Copyright 2012 EMC Corporation. All rights reserved.                             8
Partners
                                                          Switch
                                                           – Hosting Facilities


                                                          VMware
                                                           – Operational Support
                                                           – Rubicon




© Copyright 2012 EMC Corporation. All rights reserved.                             9
Peek @ the Cluster




© Copyright 2012 EMC Corporation. All rights reserved.   10
Cluster Statistics
 Largest cluster for Apache Hadoop validation!

 # Of Physical Hosts : > 1,000 (> 10,000 with VMs)
 # Of Racks : 54 (50 just for the DataNodes)
 # Of Processors : > 24,000
 Amount Of RAM : > 48TB
 Amount of Disk Capacity : > 24PB
        – “Equivalent to nearly half of the entire written works of
          mankind from the beginning of recorded history”



© Copyright 2012 EMC Corporation. All rights reserved.                11
Namenode




© Copyright 2012 EMC Corporation. All rights reserved.   12
Job Tracker




© Copyright 2012 EMC Corporation. All rights reserved.   13
CPU




© Copyright 2012 EMC Corporation. All rights reserved.   14
Use Cases




© Copyright 2012 EMC Corporation. All rights reserved.          15
Hadoop Review




© Copyright 2012 EMC Corporation. All rights reserved.   16
Hadoop Shuffle




© Copyright 2012 EMC Corporation. All rights reserved.   17
Initial Use Cases
 Apache Hadoop Validation
 Mellanox UDA
 Terasort Benchmark




© Copyright 2012 EMC Corporation. All rights reserved.   18
Apache Hadoop Validation
 Purpose
        – Run Apache Hadoop Validation at Scale
        – Validate cluster configuration


 Various Configurations Validated
        – Standard Out Of The Box Configs
        – Configs Modified For IO Intensive Processing




© Copyright 2012 EMC Corporation. All rights reserved.   19
Apache Hadoop Preliminary Results
                                       Apache Hadoop-1.0.0 validation
                          1.2


                           1


                          0.8
   Execution Time (Min)




                          0.6


                          0.4                                           1000 Nodes


                          0.2


                           0




© Copyright 2012 EMC Corporation. All rights reserved.                               20
Apache Hadoop Findings
 Apache BigTop for integration tests
 Functional validation passed as expected


 Next Steps
        – Identify integration cases
        – Contribute back to BigTop
        – Stabilize Hadoop 0.23




© Copyright 2012 EMC Corporation. All rights reserved.   21
Mellanox UDA - Overview
                                                          RDMA in Hadoop Shuffle stage
                                                          Register Map & Reduce task buffer
                                                          Hadoop JT for Task completion
                                                          cp sorted maptask o/p  reduce i/p
                                                          Perform in-memory merge @reduce
                                                          Avoid disk spills for large inputs
                                                          Reduce CPU load for sort & merge
                                                          GP + Mellanox collaboration
                                                            – Open Sourcing UDA




© Copyright 2012 EMC Corporation. All rights reserved.                                          22
Mellanox UDA Preliminary Results
 Preliminary UDA results provided by Mellanox
 Show improvement with UDA vs Vanilla Hadoop.
 Better CPU utilization
 Reduced execution time


 Next Steps
        – Run on Analytics Workbench schedule for June 2012
        – Configuration on the workbench to turn it on/off




© Copyright 2012 EMC Corporation. All rights reserved.        23
TeraSort Benchmark
 Industry standard benchmark
 Good validation of configuration
 3 Steps
        – Teragen – Generate 1TB of data
        – Terasort – Sort generated data
        – Teravalidate – Validate the sort
 Measure time for each step




© Copyright 2012 EMC Corporation. All rights reserved.   24
TeraSort Benchmark Preliminary Results
                              Apache Hadoop-1.0.0 validation - TeraSort
                          9

                          8

                          7
   Exection Time in Sec




                          6

                          5

                                                                                                TeraGen
                          4
                                                                                                TeraSort
                          3

                          2

                          1

                          0
                                       1 TB                                             10 TB
                                                         # of TB Generated and Sorted




© Copyright 2012 EMC Corporation. All rights reserved.                                                     25
TeraSort Benchmark Findings
 Minimal tuning of configuration
 Results are within expected range.
 Next Steps
        – Tune the cluster for optimal performance
        – Use the benchmark for every new release




© Copyright 2012 EMC Corporation. All rights reserved.   26
Lessons Learnt




© Copyright 2012 EMC Corporation. All rights reserved.   27
Buildout Progress
                             1200
                                                                                         racked   ready
                             1000
           Number of nodes




                             800


                             600


                             400


                             200


                                0
                               Dec '11   Jan '12         Feb '12   Mar '12   April '12
                                                          Month




© Copyright 2012 EMC Corporation. All rights reserved.                                                    28
―Real‖ Hadoop Cluster




© Copyright 2012 EMC Corporation. All rights reserved.   29
Categories
 Racking & Stacking                                      Hadoop Deployment


 Networking                                              Post deployment


 Non Hadoop Hosts                                        Process


 Base OS Setup




© Copyright 2012 EMC Corporation. All rights reserved.                         30
In Closing




© Copyright 2012 EMC Corporation. All rights reserved.           31
Upcoming work
 Workbench Tasks
        –    Load various data sets
        –    Load GPDB, Hive, Hbase, Zookeeper, etc.
        –    Load Chorus, Command center, UAP stack
        –    VM provisioning
        –    Various audits
 On-boarding candidates
        –    HD Education
        –    Apache Hadoop Build & Validate
        –    Mellanox UDA
        –    Intel HiBench
        –    Big data benchmarking
        –    Hi resolution image processing, etc. etc.



© Copyright 2012 EMC Corporation. All rights reserved.   32
A day in the life @ Switch




© Copyright 2012 EMC Corporation. All rights reserved.   33
Q&A




© Copyright 2012 EMC Corporation. All rights reserved.         34
Other Relevant Greenplum Sessions
Session                                                  Presenter          Times
Unified Analytics Platform Introduction                  Brian Wilson       Tues 10:00-11:00   Thurs 1:00-2:00
Greenplum Database Overview                              Michael Crutcher   Mon 8:30-9:30      Wed 10:00-11:00
Greenplum Hadoop Overview                                Susheel Kaushik    Mon 10:00-11:00    Wed 4:15-5:15
Greenplum DCA Overview                                   Hanxi Chen         Mon 4:00-5:00      Thurs 10:00-11:00
Greenplum Analytics Workbench                            Apurva Desai       Wed 8:30-9:30      Thurs 10:00-11:00
Analytics on Hadoop                                      Don Miner          Tues 11:30-12:30   Thurs 8:30-9:30
Optimizing Greenplum Database on VMware                  Kevin O’Leary      Mon 4:00-5:00      Tues 4:15-5:15
Virtualized Infrastructure
Big Data Driven Businesses in Action:                    Mike Maxey         Wed 4:15-5:15      Thurs 11:30-12:30
Creating Real Business Value Using
Greenplum UAP (Panel w/4 Customers)
Analytics for Business Value: Collaboration              Josh Klahr         Mon 10:00-11:00    Wed 2:45-3:45
Disruptive Data Science — How Data                       Annika Jimenez     Tues 4:15-5:15     Thurs 11:30-12:30
Science and Big Data are Transforming                    David Dietrich
Business, IT and People




© Copyright 2012 EMC Corporation. All rights reserved.                                                             35
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Contenu connexe

Tendances

SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
aidanshribman
 
Ugif 04 2011 storage prov-pot_march_2011
Ugif 04 2011   storage prov-pot_march_2011Ugif 04 2011   storage prov-pot_march_2011
Ugif 04 2011 storage prov-pot_march_2011
UGIF
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your cluster
mapr-academy
 
Avamar Run Book - 5-14-2015_v3
Avamar Run Book - 5-14-2015_v3Avamar Run Book - 5-14-2015_v3
Avamar Run Book - 5-14-2015_v3
Bill Oliver
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
DataWorks Summit
 
Debugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxDebugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle Linux
Terry Wang
 

Tendances (20)

SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
Collaborate07kmohiuddin
Collaborate07kmohiuddinCollaborate07kmohiuddin
Collaborate07kmohiuddin
 
Avamar 7 2010
Avamar 7 2010Avamar 7 2010
Avamar 7 2010
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Ugif 04 2011 storage prov-pot_march_2011
Ugif 04 2011   storage prov-pot_march_2011Ugif 04 2011   storage prov-pot_march_2011
Ugif 04 2011 storage prov-pot_march_2011
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your cluster
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
EMC Deduplication Fundamentals
EMC Deduplication FundamentalsEMC Deduplication Fundamentals
EMC Deduplication Fundamentals
 
Avamar Run Book - 5-14-2015_v3
Avamar Run Book - 5-14-2015_v3Avamar Run Book - 5-14-2015_v3
Avamar Run Book - 5-14-2015_v3
 
B17 Eliminating the database bottleneck
B17 Eliminating the database bottleneckB17 Eliminating the database bottleneck
B17 Eliminating the database bottleneck
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Presentation deduplication backup software and system
Presentation   deduplication backup software and systemPresentation   deduplication backup software and system
Presentation deduplication backup software and system
 
Debugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxDebugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle Linux
 
Database performance with Dell PowerEdge PCIe Express Flash SSDs
Database performance with Dell PowerEdge PCIe Express Flash SSDsDatabase performance with Dell PowerEdge PCIe Express Flash SSDs
Database performance with Dell PowerEdge PCIe Express Flash SSDs
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 
Commercial track 1_The Power of UDP
Commercial track 1_The Power of UDPCommercial track 1_The Power of UDP
Commercial track 1_The Power of UDP
 
50a volumes
50a volumes50a volumes
50a volumes
 
Solaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and TuningSolaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and Tuning
 

En vedette

Jump start your analytics investments and accelerate analytics ROI
Jump start your analytics investments and accelerate analytics ROIJump start your analytics investments and accelerate analytics ROI
Jump start your analytics investments and accelerate analytics ROI
Actian Corporation
 
Iig excel 2010_exercise_vn
Iig excel 2010_exercise_vnIig excel 2010_exercise_vn
Iig excel 2010_exercise_vn
Chi Lê Yến
 
MySQL Administration and Monitoring
MySQL Administration and MonitoringMySQL Administration and Monitoring
MySQL Administration and Monitoring
Mark Leith
 

En vedette (20)

White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian... White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 
White Paper: Monitoring EMC Greenplum DCA with Nagios - EMC Greenplum Data Co...
White Paper: Monitoring EMC Greenplum DCA with Nagios - EMC Greenplum Data Co...White Paper: Monitoring EMC Greenplum DCA with Nagios - EMC Greenplum Data Co...
White Paper: Monitoring EMC Greenplum DCA with Nagios - EMC Greenplum Data Co...
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
Actian Vector Whitepaper
 Actian Vector Whitepaper Actian Vector Whitepaper
Actian Vector Whitepaper
 
Actian Analytics Platform - Hadoop SQL Edition
Actian Analytics Platform - Hadoop SQL EditionActian Analytics Platform - Hadoop SQL Edition
Actian Analytics Platform - Hadoop SQL Edition
 
Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Jump start your analytics investments and accelerate analytics ROI
Jump start your analytics investments and accelerate analytics ROIJump start your analytics investments and accelerate analytics ROI
Jump start your analytics investments and accelerate analytics ROI
 
Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview
 
Turning Your Data Lake into Measurable Business Value
Turning Your Data Lake into Measurable Business ValueTurning Your Data Lake into Measurable Business Value
Turning Your Data Lake into Measurable Business Value
 
1. Ms Excel Ung Dung Trong Kinh Te (Phan I)
1. Ms Excel Ung Dung Trong Kinh Te (Phan I)1. Ms Excel Ung Dung Trong Kinh Te (Phan I)
1. Ms Excel Ung Dung Trong Kinh Te (Phan I)
 
MySQL Workbench for DFW Unix Users Group
MySQL Workbench for DFW Unix Users GroupMySQL Workbench for DFW Unix Users Group
MySQL Workbench for DFW Unix Users Group
 
Iig excel 2010_exercise_vn
Iig excel 2010_exercise_vnIig excel 2010_exercise_vn
Iig excel 2010_exercise_vn
 
Workbench "Always on the Job!"© software-as-a-service for social collaboration
Workbench "Always on the Job!"© software-as-a-service for social collaborationWorkbench "Always on the Job!"© software-as-a-service for social collaboration
Workbench "Always on the Job!"© software-as-a-service for social collaboration
 
Lap+trinh+vba
Lap+trinh+vbaLap+trinh+vba
Lap+trinh+vba
 
greenplum installation guide - 4 node VM
greenplum installation guide - 4 node VM greenplum installation guide - 4 node VM
greenplum installation guide - 4 node VM
 
Vba cho ppt
Vba cho pptVba cho ppt
Vba cho ppt
 
Bài giảng ACCESS - VBA
Bài giảng ACCESS - VBABài giảng ACCESS - VBA
Bài giảng ACCESS - VBA
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
MySQL Administration and Monitoring
MySQL Administration and MonitoringMySQL Administration and Monitoring
MySQL Administration and Monitoring
 

Similaire à Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
DataWorks Summit
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Romeo Kienzler
 
A27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practicesA27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practices
Insight Technology, Inc.
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFS
DataWorks Summit
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
Xiao Qin
 

Similaire à Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? (20)

HugNov14
HugNov14HugNov14
HugNov14
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
 
A27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practicesA27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practices
 
Transform Your SAP Landscape Using EMC Technologies
Transform Your SAP Landscape Using EMC TechnologiesTransform Your SAP Landscape Using EMC Technologies
Transform Your SAP Landscape Using EMC Technologies
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFS
 
Virtual Hadoop Introduction In Chinese
Virtual Hadoop Introduction In ChineseVirtual Hadoop Introduction In Chinese
Virtual Hadoop Introduction In Chinese
 
In-Place analytics with Unified Data Access
In-Place analytics with Unified Data AccessIn-Place analytics with Unified Data Access
In-Place analytics with Unified Data Access
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
 
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 

Plus de EMC

Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
EMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
EMC
 

Plus de EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

  • 1. Greenplum Analytics Workbench APURVA DESAI © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Overview © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. What is Hadoop?  What is Hadoop? – Distributed computing paradigm – File system – HDFS – Processing framework –Map Reduce – Languages – PIG, HIVE – Key Value Store – Hbase  Why is it important? – BIG Data is everywhere – BIG Data is mostly unstructured – Need affordable, scalable no-sql processing © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. Analytics Workbench - Motivation  Open source – Hadoop industry is nascent – BIG Data development needs scale  Greenplum – Innovation & Experimentation platform – Contribute to the community – GPDB & GPHD - Mixed mode environment © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Vision © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Buildout Pre-requisites  Hardware systems integration  Hadoop experience  Program Management  Partner ecosystem Greenplum has Inhouse Expertise © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Team Introduction  System Integration – Greg, Eric, Don, Dave, Patrick  Program Management – Mike, Joe  Hadoop – Apurva, Judes, Clinton, Chandra, Ashwin © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Partners  Intel – 2000 Westmere CPUs  Mellanox – 1,000+ NICs – 72 IB switches  Micron – 6,000 8GB DRAM  Seagate – 12,000 2TB Drives  Supermicro – 1000 Chasis/MB © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Partners  Switch – Hosting Facilities  VMware – Operational Support – Rubicon © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. Peek @ the Cluster © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. Cluster Statistics Largest cluster for Apache Hadoop validation!  # Of Physical Hosts : > 1,000 (> 10,000 with VMs)  # Of Racks : 54 (50 just for the DataNodes)  # Of Processors : > 24,000  Amount Of RAM : > 48TB  Amount of Disk Capacity : > 24PB – “Equivalent to nearly half of the entire written works of mankind from the beginning of recorded history” © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Namenode © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Job Tracker © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. CPU © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Use Cases © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Hadoop Review © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Hadoop Shuffle © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Initial Use Cases  Apache Hadoop Validation  Mellanox UDA  Terasort Benchmark © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Apache Hadoop Validation  Purpose – Run Apache Hadoop Validation at Scale – Validate cluster configuration  Various Configurations Validated – Standard Out Of The Box Configs – Configs Modified For IO Intensive Processing © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Apache Hadoop Preliminary Results Apache Hadoop-1.0.0 validation 1.2 1 0.8 Execution Time (Min) 0.6 0.4 1000 Nodes 0.2 0 © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Apache Hadoop Findings  Apache BigTop for integration tests  Functional validation passed as expected  Next Steps – Identify integration cases – Contribute back to BigTop – Stabilize Hadoop 0.23 © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Mellanox UDA - Overview  RDMA in Hadoop Shuffle stage  Register Map & Reduce task buffer  Hadoop JT for Task completion  cp sorted maptask o/p  reduce i/p  Perform in-memory merge @reduce  Avoid disk spills for large inputs  Reduce CPU load for sort & merge  GP + Mellanox collaboration – Open Sourcing UDA © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Mellanox UDA Preliminary Results  Preliminary UDA results provided by Mellanox  Show improvement with UDA vs Vanilla Hadoop.  Better CPU utilization  Reduced execution time  Next Steps – Run on Analytics Workbench schedule for June 2012 – Configuration on the workbench to turn it on/off © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. TeraSort Benchmark  Industry standard benchmark  Good validation of configuration  3 Steps – Teragen – Generate 1TB of data – Terasort – Sort generated data – Teravalidate – Validate the sort  Measure time for each step © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. TeraSort Benchmark Preliminary Results Apache Hadoop-1.0.0 validation - TeraSort 9 8 7 Exection Time in Sec 6 5 TeraGen 4 TeraSort 3 2 1 0 1 TB 10 TB # of TB Generated and Sorted © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. TeraSort Benchmark Findings  Minimal tuning of configuration  Results are within expected range.  Next Steps – Tune the cluster for optimal performance – Use the benchmark for every new release © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Lessons Learnt © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. Buildout Progress 1200 racked ready 1000 Number of nodes 800 600 400 200 0 Dec '11 Jan '12 Feb '12 Mar '12 April '12 Month © Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29. ―Real‖ Hadoop Cluster © Copyright 2012 EMC Corporation. All rights reserved. 29
  • 30. Categories  Racking & Stacking  Hadoop Deployment  Networking  Post deployment  Non Hadoop Hosts  Process  Base OS Setup © Copyright 2012 EMC Corporation. All rights reserved. 30
  • 31. In Closing © Copyright 2012 EMC Corporation. All rights reserved. 31
  • 32. Upcoming work  Workbench Tasks – Load various data sets – Load GPDB, Hive, Hbase, Zookeeper, etc. – Load Chorus, Command center, UAP stack – VM provisioning – Various audits  On-boarding candidates – HD Education – Apache Hadoop Build & Validate – Mellanox UDA – Intel HiBench – Big data benchmarking – Hi resolution image processing, etc. etc. © Copyright 2012 EMC Corporation. All rights reserved. 32
  • 33. A day in the life @ Switch © Copyright 2012 EMC Corporation. All rights reserved. 33
  • 34. Q&A © Copyright 2012 EMC Corporation. All rights reserved. 34
  • 35. Other Relevant Greenplum Sessions Session Presenter Times Unified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00 Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00 Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15 Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00 Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00 Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30 Optimizing Greenplum Database on VMware Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15 Virtualized Infrastructure Big Data Driven Businesses in Action: Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30 Creating Real Business Value Using Greenplum UAP (Panel w/4 Customers) Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45 Disruptive Data Science — How Data Annika Jimenez Tues 4:15-5:15 Thurs 11:30-12:30 Science and Big Data are Transforming David Dietrich Business, IT and People © Copyright 2012 EMC Corporation. All rights reserved. 35