SlideShare une entreprise Scribd logo
1  sur  22
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
myHadoop - Hadoop-on-Demand
on Traditional HPC Resources
Sriram Krishnan, Ph.D.
sriram@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Acknowledgements
• MahidharTatineni
• ChaitanyaBaru
• Jim Hayes
• ShavaSmallen
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Outline
• Motivations
• Technical Challenges
• Implementation Details
• Performance Evaluation
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Motivations
• An open source tool for running Hadoop jobs on
HPC resources
• Easy to configure and use for the end-user
• Play nice with existing batch systems on HPC resources
• Why do we need such a tool?
• End-users: I already have Hadoop code – and I only have
access to regular HPC-style resources
• Computer Scientists: I want to study the implications of
using Hadoop on HPC resources
• And I don’t have root access to these resources
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Some Ground Rules
• What this presentation is:
• A “how-to” for running Hadoop jobs on HPC resources
using myHadoop
• A description of the performance implications of using
myHadoop
• What this presentation is not:
• A propaganda for the use of Hadoop on HPC resources
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Main Challenges
• Shared-nothing (Hadoop) versus HPC-style
architectures
• In terms of philosophies and implementation
• Control and co-existence of Hadoop and HPC
batch systems
• Typically both Hadoop and HPC batch systems
(viz., SGE, PBS) need completely control over the
resources for scheduling purposes
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Traditional HPC Architecture
PARALLEL FILE SYSTEM COMPUTE CLUSTER WITH
MINIMAL LOCAL
STORAGE
Shared-nothing (MapReduce-style) Architectures
COMPUTE/DATA CLUSTER
WITH LOCAL STOARGE
ETHERNET
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Hadoop and HPC Batch Systems
• Access to HPC resources is typically via batch systems
– viz. PBS, SGE, Condor, etc
• These systems have complete control over the compute resources
• Users typically can’t log in directly to the compute nodes (via ssh) to
start various daemons
• Hadoop manages its resources using its own set of
daemons
• NameNode&DataNodefor Hadoop Distributed File System (HDFS)
• JobTracker&TaskTrackerfor MapReduce jobs
• Hadoop daemons and batch systems can’t co-exist
seamlessly
• Will interfere with each other’s scheduling algorithms
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
myHadoop Requirements
1. Enabling execution of Hadoop jobs on shared HPC
resources via traditional batch systems
a) Working with a variety of batch systems (PBS, SGE, etc)
2. Allowing users to run Hadoop jobs without needing
root-level access
3. Enabling multiple users to simultaneously execute
Hadoop jobs on the shared resource
4. Allowing users to either run a fresh Hadoop instance
each time (a), or store HDFS state for future runs (b)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
COMPUTE NODES
PERSISTENT MODE NON-PERSISTENT MODE
BATCH PROCESSING SYSTEM (PBS, SGE)
PARALLEL FILE SYSTEM
HADOOP DAEMONS
myHadoop Architecture
[2, 3]
[1]
[4(a)][4(b)]
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Implementation Details: PBS, SGE
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
User Workflow
BOOTSTRAP
TEARDOWN
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Performance Evaluation
• Goals and non-goals
• Study the performance overhead and implication of myHadoop
• Not to optimize/improve existing Hadoop code
• Software and Hardware
• Triton Compute Cluster (http://tritonresource.sdsc.edu/)
• Triton Data Oasis (Lustre-based parallel file system) for data storage, and for
HDFS in “persistent mode”
• Apache Hadoop version 0.20.2
• Various parameters tuned for performance on Triton
• Applications
• Compute-intensive: HadoopBlast (Indiana University)
• Modest-sized inputs – 128 query sequences (70K each)
• Compared against NR database – 200MB in size
• Data-intensive: Data Selections (OpenTopography Facility at SDSC)
• Input size from 1GB to 100GB
• Sub-selecting around 10% of the entire dataset
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
HadoopBlast
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data Selections
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Related Work
• Recipe for running Hadoop over PBS in blogosphere
• http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using-
pbs.html
• myHadoop is “inspired” by their approach – but is more general-
purpose and configurable
• Apache Hadoop On Demand (HOD)
• http://hadoop.apache.org/common/docs/r0.17.0/hod.html
• Only PBS support, needs external HDFS, harder to use, and has
trouble with multiple concurrent Hadoop instances
• CloudBatch – batch queuing system on clouds
• Use of Hadoop to run batch systems like PBS
• Exact opposite of our goals – but similar approach
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Center for Large-Scale Data Systems Research (CLDS)
CLDS
Industry Advisory
Board
Academic Advisory
Board
Benchmarking,Perfor
mance Evaluation
and Systems
Development
Projects
Industry
Forums and
Professional
Education
Industry-University Consortium on Software for Large-scale Data
Systems
How Much
Information?
Project
Public
Private
Personal
Visiting Fellows
Information Metrology
Data Growth, Information Mgt
Cloud Storage
Architecture
Cloud Storage and
Performance Benchmarking
Industry Interchange
Mgt, Technical Forums
• Student internships
• Joint collaborations
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Summary
• myHadoop – an open source tool for running Hadoop
jobs on HPC resources
• Without need for root-level access
• Co-exists with traditional batch systems
• Allows “persistent” and “non-persistent” modes to save HDFS state
across runs
• Tested on SDSC Triton, TeraGrid and UC Grid resources
• More information
• Software: https://sourceforge.net/projects/myhadoop/
• SDSC Tech Report: http://www.sdsc.edu/pub/techreports/SDSC-TR-
2011-2-Hadoop.pdf
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Questions?
• Email me at sriram@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Appendix
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
io.file.buffer.size 131072 Size of read/write buffer
fs.inmemory.size.mb 650 Size of in-memory FS for merging outputs
io.sort.mb 650 Memory limit for sorting data
core-site.xml:
dfs.replication 2 Number of times data is replicated
dfs.block.size 134217728 HDFS block size in bytes
dfs.datanode.handler.count 64 Number of handlers to serve block requests
hdfs-site.xml:
mapred.reduce.parallel.copies 4 Number of parallel copies run by
reducers
mapred.tasktracker.map.tasks.maximum 4 Max map tasks to run simultaneously
mapred.tasktracker.reduce.tasks.maximum 2 Max reduce tasks to run simultaneously
mapred.job.reuse.jvm.num.tasks 1 Reuse the JVM between tasks
mapred.child.java.opts -Xmx1024m Large heap size for child JVMs
hdfs-site.xml:
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data SelectCounts on Dash

Contenu connexe

Tendances

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
 
Never late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNNever late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNDataWorks Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchDataWorks Summit/Hadoop Summit
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 

Tendances (20)

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
ebay
ebayebay
ebay
 
Never late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNNever late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARN
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 

Similaire à myHadoop - Hadoop-on-Demand on Traditional HPC Resources

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdataTom Rogers
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Big data analytics_using_hadoop
Big data analytics_using_hadoopBig data analytics_using_hadoop
Big data analytics_using_hadoopKnowledgehut
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 

Similaire à myHadoop - Hadoop-on-Demand on Traditional HPC Resources (20)

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Big data analytics_using_hadoop
Big data analytics_using_hadoopBig data analytics_using_hadoop
Big data analytics_using_hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 

Dernier

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Dernier (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

  • 1. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop - Hadoop-on-Demand on Traditional HPC Resources Sriram Krishnan, Ph.D. sriram@sdsc.edu
  • 2. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Acknowledgements • MahidharTatineni • ChaitanyaBaru • Jim Hayes • ShavaSmallen
  • 3. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Outline • Motivations • Technical Challenges • Implementation Details • Performance Evaluation
  • 4. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Motivations • An open source tool for running Hadoop jobs on HPC resources • Easy to configure and use for the end-user • Play nice with existing batch systems on HPC resources • Why do we need such a tool? • End-users: I already have Hadoop code – and I only have access to regular HPC-style resources • Computer Scientists: I want to study the implications of using Hadoop on HPC resources • And I don’t have root access to these resources
  • 5. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Some Ground Rules • What this presentation is: • A “how-to” for running Hadoop jobs on HPC resources using myHadoop • A description of the performance implications of using myHadoop • What this presentation is not: • A propaganda for the use of Hadoop on HPC resources
  • 6. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Main Challenges • Shared-nothing (Hadoop) versus HPC-style architectures • In terms of philosophies and implementation • Control and co-existence of Hadoop and HPC batch systems • Typically both Hadoop and HPC batch systems (viz., SGE, PBS) need completely control over the resources for scheduling purposes
  • 7. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Traditional HPC Architecture PARALLEL FILE SYSTEM COMPUTE CLUSTER WITH MINIMAL LOCAL STORAGE Shared-nothing (MapReduce-style) Architectures COMPUTE/DATA CLUSTER WITH LOCAL STOARGE ETHERNET
  • 8. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Hadoop and HPC Batch Systems • Access to HPC resources is typically via batch systems – viz. PBS, SGE, Condor, etc • These systems have complete control over the compute resources • Users typically can’t log in directly to the compute nodes (via ssh) to start various daemons • Hadoop manages its resources using its own set of daemons • NameNode&DataNodefor Hadoop Distributed File System (HDFS) • JobTracker&TaskTrackerfor MapReduce jobs • Hadoop daemons and batch systems can’t co-exist seamlessly • Will interfere with each other’s scheduling algorithms
  • 9. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop Requirements 1. Enabling execution of Hadoop jobs on shared HPC resources via traditional batch systems a) Working with a variety of batch systems (PBS, SGE, etc) 2. Allowing users to run Hadoop jobs without needing root-level access 3. Enabling multiple users to simultaneously execute Hadoop jobs on the shared resource 4. Allowing users to either run a fresh Hadoop instance each time (a), or store HDFS state for future runs (b)
  • 10. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO COMPUTE NODES PERSISTENT MODE NON-PERSISTENT MODE BATCH PROCESSING SYSTEM (PBS, SGE) PARALLEL FILE SYSTEM HADOOP DAEMONS myHadoop Architecture [2, 3] [1] [4(a)][4(b)]
  • 11. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Implementation Details: PBS, SGE
  • 12. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO User Workflow BOOTSTRAP TEARDOWN
  • 13. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Performance Evaluation • Goals and non-goals • Study the performance overhead and implication of myHadoop • Not to optimize/improve existing Hadoop code • Software and Hardware • Triton Compute Cluster (http://tritonresource.sdsc.edu/) • Triton Data Oasis (Lustre-based parallel file system) for data storage, and for HDFS in “persistent mode” • Apache Hadoop version 0.20.2 • Various parameters tuned for performance on Triton • Applications • Compute-intensive: HadoopBlast (Indiana University) • Modest-sized inputs – 128 query sequences (70K each) • Compared against NR database – 200MB in size • Data-intensive: Data Selections (OpenTopography Facility at SDSC) • Input size from 1GB to 100GB • Sub-selecting around 10% of the entire dataset
  • 14. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO HadoopBlast
  • 15. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data Selections
  • 16. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Related Work • Recipe for running Hadoop over PBS in blogosphere • http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using- pbs.html • myHadoop is “inspired” by their approach – but is more general- purpose and configurable • Apache Hadoop On Demand (HOD) • http://hadoop.apache.org/common/docs/r0.17.0/hod.html • Only PBS support, needs external HDFS, harder to use, and has trouble with multiple concurrent Hadoop instances • CloudBatch – batch queuing system on clouds • Use of Hadoop to run batch systems like PBS • Exact opposite of our goals – but similar approach
  • 17. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Center for Large-Scale Data Systems Research (CLDS) CLDS Industry Advisory Board Academic Advisory Board Benchmarking,Perfor mance Evaluation and Systems Development Projects Industry Forums and Professional Education Industry-University Consortium on Software for Large-scale Data Systems How Much Information? Project Public Private Personal Visiting Fellows Information Metrology Data Growth, Information Mgt Cloud Storage Architecture Cloud Storage and Performance Benchmarking Industry Interchange Mgt, Technical Forums • Student internships • Joint collaborations
  • 18. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Summary • myHadoop – an open source tool for running Hadoop jobs on HPC resources • Without need for root-level access • Co-exists with traditional batch systems • Allows “persistent” and “non-persistent” modes to save HDFS state across runs • Tested on SDSC Triton, TeraGrid and UC Grid resources • More information • Software: https://sourceforge.net/projects/myhadoop/ • SDSC Tech Report: http://www.sdsc.edu/pub/techreports/SDSC-TR- 2011-2-Hadoop.pdf
  • 19. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Questions? • Email me at sriram@sdsc.edu
  • 20. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Appendix
  • 21. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO io.file.buffer.size 131072 Size of read/write buffer fs.inmemory.size.mb 650 Size of in-memory FS for merging outputs io.sort.mb 650 Memory limit for sorting data core-site.xml: dfs.replication 2 Number of times data is replicated dfs.block.size 134217728 HDFS block size in bytes dfs.datanode.handler.count 64 Number of handlers to serve block requests hdfs-site.xml: mapred.reduce.parallel.copies 4 Number of parallel copies run by reducers mapred.tasktracker.map.tasks.maximum 4 Max map tasks to run simultaneously mapred.tasktracker.reduce.tasks.maximum 2 Max reduce tasks to run simultaneously mapred.job.reuse.jvm.num.tasks 1 Reuse the JVM between tasks mapred.child.java.opts -Xmx1024m Large heap size for child JVMs hdfs-site.xml:
  • 22. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data SelectCounts on Dash

Notes de l'éditeur

  1. Mahidhar, Chaitan: Architecture, prototypingJim: myHadoop rollShava: UC Grid support
  2. Motivations – why myHadoopTechnical challenges – what are the problemsImplementation Details – howPerformance evaluation - findings
  3. Note: we didn’t make up these requirements. Came out of our existing requirements as end-users and computer scientistsMost of us have access to resources such as the TeraGrid, UC Grid, SDSC TritonI have no official affiliation with any of those resources – but I had access to these resources, and wanted to use them for performance studies
  4. Co-location of data and computes in shared-nothing – no centralized shared storage in the Hadoop modelHigh performance parallel file systems for HPC resources
  5. Scheduler access2, 3) Non-root concurrent users4) Persistent and non-persistent modes
  6. Data loads not very dominant – more CPU-intensivePerformance slightly better on local disk – more contention on Oasis, and also that Lustre is optimized for large files, not lots of smaller ones
  7. Data loads are the dominating factor for the non-persistent runsMakes more sense to leave the data on shared file system – and use that at the HDFS locationOutput writes are also time consuming in this case – so might as well leave the data in HDFS for future runs