SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Apache Hadoop
Grid Patterns and Anti-Patterns


                 Arun C Murthy
                 Yahoo! Grid Team, CCDI
                 acmurthy@apache.org
Hello!
Who am I?
  Yahoo!
  ›    Grid Team (CCDI)

  ›    Lead the Apache Hadoop Map-Reduce Development Team

  Apache
  ›    Developer on Apache Hadoop since April 2006

  ›    Committer

  ›    Member of Apache Hadoop PMC




                                                     2      8/18/10
Apache Hadoop
The Software
  Hadoop Distributed File System
  Hadoop Map-Reduce
  Open source from Apache
  Written in Java
  Runs on
  ›    Linux, Solaris, Mac OS/X

  ›    Commodity hardware




                                    3   8/18/10
Storage
HDFS
  Designed to store large files
  Stores files as large blocks (64 to 128 MB)
  Each block stored on multiple servers
  Data is automatically re-replicated on need
  Accessed from command line, Java API or C API




                                                 4   8/18/10
Data Processing
Hadoop Map-Reduce
  Map-Reduce is a programming model for efficient distributed computing
  Efficiency from
  ›    Streaming through data, reducing seeks
  ›    Pipelining

  A good fit for a lot of applications
  ›    Log processing
  ›    Web index building




                                                5                     8/18/10
Hadoop in the Enterprise
Usage and Importance
  Large number of corporations use Apache Hadoop at scale for several business critical
   applications
  ›    Large, shared, multi-tenant deployments to minimize fragmentation across organizations

  Millions of dollars at stake!
  ›    Yahoo
       •    Advertising, Search

       •    40,000 machines and counting

  http://wiki.apache.org/hadoop/PoweredBy




                                                    6                             8/18/10
Hadoop in the Enterprise
… however
  Hadoop isn’t a silver bullet (at least as yet!)
  ›    Hadoop still depends on users to utilize it effectively
  ›    Pig/Hive help, one can still write badly suited queries

  Need to adapt legacy applications to Hadoop, especially the Map-Reduce paradigm
  Efficient usage of Hadoop clusters is critical to getting return on the investment




                                                        7                   8/18/10
Hadoop Map-Reduce
Overview
  It works like a Unix pipeline:
  ›    cat input | grep |   sort     | unique -c | cat > output
  ›     Input    | Map | Shuffle & Sort | Reduce | Output

  Works on key/value pairs
  ›    map <k1, v1> -> <k2, v2>
  ›    reduce <k2, v2> -> <k3, v3>




                                                 8                 8/18/10
Best Practices
Input to Applications
  Optimized to process large data-sets
  Pattern: Coalesce processing of multiple small input files into smaller number of maps
   and use larger HDFS block-sizes for processing very large data-sets.




                                             9                           8/18/10
Best Practices
Map-Reduce - Mappers
  Process multiple-files per map for jobs with very large number of small input files
  Process large chunks of data per-map for large-scale data-processing
  ›    PetaSort – 66,000 maps with 12.5G per map

  Pattern: Unless the application's maps are heavily CPU bound, there is almost no
   reason to ever require more than 60,000-70,000 maps for a single application.




                                                   10                      8/18/10
Best Practices
Map-Reduce - Mappers
  Process multiple-files per map for jobs with very large number of small input files
  Process large chunks of data per-map for large-scale data-processing
  ›    PetaSort – 66,000 maps with 12.5G per map

  The shuffle cross-bar (maps * reduces) is a key performance factor
  Pattern: Applications should use fewer maps to process data in parallel, as few as
   possible without having really bad failure recovery cases.
  ›    Unless the application's maps are heavily CPU bound, there is almost no reason to ever require
       more than 60,000-70,000 maps for a single application




                                                    11                            8/18/10
Best Practices
Map-Reduce – Combiner and Shuffle
  Combiner
  ›    Map-side aggregation to help reduce network traffic for the shuffle
  ›    Cost of using combiners

  Shuffle
  ›    Compression of intermediate output

  Pattern: Use combiners judiciously, ensure they really work! Compress intermediate
   outputs




                                                      12                     8/18/10
Best Practices
Map-Reduce – Reducers
  Efficiency depends on shuffle, and the cross-bar
  Configure appropriate number of reduces
  ›    Too few reduces hurt the nodes
  ›    Too many hurt the cross-bar

  Pattern: Applications should ensure that each reduce should process at least 1-2 GB of
   data, and at most 5-10GB of data, in most scenarios.




                                             13                         8/18/10
Best Practices
Map-Reduce – Output
  Number of output artifacts is linear w.r.t. number of configured reduces
  Compress outputs
  Use appropriate file-formats for the output
  ›    E.g. compressed text-files is not a great idea if you aren’t using a splittable codec

  Think of the consumer of your data-set!
  Consider using larger HDFS block-sizes.
  Pattern: : Application outputs to be few large files, with each file spanning multiple
   HDFS blocks and appropriately compressed.




                                                       14                              8/18/10
Best Practices
Map-Reduce – Distributed Cache
  Efficient distribution of read-only files for applications
  Designed for small number of mid-sized files
  Pattern: Applications should ensure that artifacts in the distributed-cache should not
   require more i/o than the actual input to the application tasks




                                                  15                       8/18/10
Best Practices
Map-Reduce – Counters
  Global (across all tasks) counters, aggregated by the framework
  Expensive!
  Pattern: Applications should not use more than 10, 15 or 25 custom counters.




                                             16                        8/18/10
Best Practices
Map-Reduce – Total Order Outputs
  Sampling Partitioner
  ›    Do not use a single reducer!
  ›    E.g. Terasort/Petasort benchmarks

  Joining fully sorted data-sets
  ›    Do not need same cardinality e.g. number of buckets for the data-sets being joined

  Pattern: Use combiners judiciously, ensure they really work!




                                                    17                             8/18/10
Best Practices
HDFS – NameNode and JobTracker Operations
  NameNode: Please don’t hurt me!
  ›    Not yet a silver bullet…
  ›    Do not perform metadata operations for map/reduce tasks at the backend

  Do not contact for JobTracker for cluster statistics etc. from the backend
  Pattern: Applications should not perform any metadata operations on the file-system
   from the backend, they should be confined to the job-client during job-submission.
   Furthermore, applications should be careful not to contact the JobTracker from the
   backend.




                                                   18                           8/18/10
Best Practices
Map-Reduce – Logs and Web-UI
  Tasks’ stdout/stderr stored on TaskTrackers
  ›    Limit amount of logs

  JobTracker/NameNode Web-UI
  ›    Do not screen-scrape!




                                             19   8/18/10
Best Practices
Oozie – Workflows
  Production pipelines are run via Oozie
  Ensure workflows have small number of medium-to-large sized Map-Reduce jobs
  ›    Collapse smaller jobs

  Pattern: A single Map-Reduce job in a workflow should process at least a few tens of
   GB of data.




                                             20                         8/18/10
Anti-Patterns
In a large enough cluster, you see any and all of these…
  Applications not using a higher-level interface such as Pig/Hive
  Processing thousands of small files (sized less than 1 HDFS block, typically 128MB)
   with one map processing a single small file.

  Processing very large data-sets with small HDFS block size i.e. 128MB resulting in tens
   of thousands of maps.
  Applications with a large number (thousands) of maps with a very small runtime (e.g.
   5s).
  Straight-forward aggregations without the use of the Combiner.
  Applications with greater than 60,000-70,000 maps.
  Applications processing large data-sets with very few reduces (e.g. 1).
  ›    Pig scripts processing large data-sets without using the PARALLEL keyword

  ›    Applications using a single reduce for total-order amount the output records




                                                     21                               8/18/10
Anti-Patterns
  Applications processing data with very large number of reduces, such that each reduce
   processes less than 1-2GB of data.
  Applications writing out multiple, small, output files from each reduce.
  Applications using the DistributedCache to distribute a large number of artifacts and/or
   very large artifacts (hundreds of MBs each).
  Applications using tens or hundreds of counters per task.
  Applications performing metadata operations (e.g. listStatus) on the file-system from
   the map/reduce tasks.
  Applications doing screen scraping of JobTracker web-ui for status of queues/jobs or
   worse, job-history of completed jobs.
  Workflows comprising of hundreds or thousands of small jobs processing small
   amounts of data.



Work underway in yahoo-hadoop-0.20.200 to prevent anti-patterns


                                               22                             8/18/10
Blog Post




http://developer.yahoo.net/blogs/hadoop/2010/08/apache_hadoop_best_practices_a.html




                                          23                     8/18/10
Thanks!




Yahoo! Presentation, Confidential       24    8/18/10

Contenu connexe

Tendances

Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for HadoopHadoop User Group
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 

Tendances (20)

Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
ImpalaToGo use case
ImpalaToGo use caseImpalaToGo use case
ImpalaToGo use case
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 

En vedette

1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21Hadoop User Group
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21Hadoop User Group
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709Hadoop User Group
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitTom Croucher
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 

En vedette (20)

1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In Python
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Searching At Scale
Searching At ScaleSearching At Scale
Searching At Scale
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17
 
Mumak
MumakMumak
Mumak
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Cloudera Desktop
Cloudera DesktopCloudera Desktop
Cloudera Desktop
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 

Similaire à HUG August 2010: Best practices

Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-trainingGeohedrick
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 

Similaire à HUG August 2010: Best practices (20)

Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 

Plus de Hadoop User Group

Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Hadoop User Group
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation HadoopHadoop User Group
 

Plus de Hadoop User Group (15)

Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 

Dernier

4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 

Dernier (20)

FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 

HUG August 2010: Best practices

  • 1. Apache Hadoop Grid Patterns and Anti-Patterns Arun C Murthy Yahoo! Grid Team, CCDI acmurthy@apache.org
  • 2. Hello! Who am I?   Yahoo! ›  Grid Team (CCDI) ›  Lead the Apache Hadoop Map-Reduce Development Team   Apache ›  Developer on Apache Hadoop since April 2006 ›  Committer ›  Member of Apache Hadoop PMC 2 8/18/10
  • 3. Apache Hadoop The Software   Hadoop Distributed File System   Hadoop Map-Reduce   Open source from Apache   Written in Java   Runs on ›  Linux, Solaris, Mac OS/X ›  Commodity hardware 3 8/18/10
  • 4. Storage HDFS   Designed to store large files   Stores files as large blocks (64 to 128 MB)   Each block stored on multiple servers   Data is automatically re-replicated on need   Accessed from command line, Java API or C API 4 8/18/10
  • 5. Data Processing Hadoop Map-Reduce   Map-Reduce is a programming model for efficient distributed computing   Efficiency from ›  Streaming through data, reducing seeks ›  Pipelining   A good fit for a lot of applications ›  Log processing ›  Web index building 5 8/18/10
  • 6. Hadoop in the Enterprise Usage and Importance   Large number of corporations use Apache Hadoop at scale for several business critical applications ›  Large, shared, multi-tenant deployments to minimize fragmentation across organizations   Millions of dollars at stake! ›  Yahoo •  Advertising, Search •  40,000 machines and counting   http://wiki.apache.org/hadoop/PoweredBy 6 8/18/10
  • 7. Hadoop in the Enterprise … however   Hadoop isn’t a silver bullet (at least as yet!) ›  Hadoop still depends on users to utilize it effectively ›  Pig/Hive help, one can still write badly suited queries   Need to adapt legacy applications to Hadoop, especially the Map-Reduce paradigm   Efficient usage of Hadoop clusters is critical to getting return on the investment 7 8/18/10
  • 8. Hadoop Map-Reduce Overview   It works like a Unix pipeline: ›  cat input | grep | sort | unique -c | cat > output ›  Input | Map | Shuffle & Sort | Reduce | Output   Works on key/value pairs ›  map <k1, v1> -> <k2, v2> ›  reduce <k2, v2> -> <k3, v3> 8 8/18/10
  • 9. Best Practices Input to Applications   Optimized to process large data-sets   Pattern: Coalesce processing of multiple small input files into smaller number of maps and use larger HDFS block-sizes for processing very large data-sets. 9 8/18/10
  • 10. Best Practices Map-Reduce - Mappers   Process multiple-files per map for jobs with very large number of small input files   Process large chunks of data per-map for large-scale data-processing ›  PetaSort – 66,000 maps with 12.5G per map   Pattern: Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application. 10 8/18/10
  • 11. Best Practices Map-Reduce - Mappers   Process multiple-files per map for jobs with very large number of small input files   Process large chunks of data per-map for large-scale data-processing ›  PetaSort – 66,000 maps with 12.5G per map   The shuffle cross-bar (maps * reduces) is a key performance factor   Pattern: Applications should use fewer maps to process data in parallel, as few as possible without having really bad failure recovery cases. ›  Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application 11 8/18/10
  • 12. Best Practices Map-Reduce – Combiner and Shuffle   Combiner ›  Map-side aggregation to help reduce network traffic for the shuffle ›  Cost of using combiners   Shuffle ›  Compression of intermediate output   Pattern: Use combiners judiciously, ensure they really work! Compress intermediate outputs 12 8/18/10
  • 13. Best Practices Map-Reduce – Reducers   Efficiency depends on shuffle, and the cross-bar   Configure appropriate number of reduces ›  Too few reduces hurt the nodes ›  Too many hurt the cross-bar   Pattern: Applications should ensure that each reduce should process at least 1-2 GB of data, and at most 5-10GB of data, in most scenarios. 13 8/18/10
  • 14. Best Practices Map-Reduce – Output   Number of output artifacts is linear w.r.t. number of configured reduces   Compress outputs   Use appropriate file-formats for the output ›  E.g. compressed text-files is not a great idea if you aren’t using a splittable codec   Think of the consumer of your data-set!   Consider using larger HDFS block-sizes.   Pattern: : Application outputs to be few large files, with each file spanning multiple HDFS blocks and appropriately compressed. 14 8/18/10
  • 15. Best Practices Map-Reduce – Distributed Cache   Efficient distribution of read-only files for applications   Designed for small number of mid-sized files   Pattern: Applications should ensure that artifacts in the distributed-cache should not require more i/o than the actual input to the application tasks 15 8/18/10
  • 16. Best Practices Map-Reduce – Counters   Global (across all tasks) counters, aggregated by the framework   Expensive!   Pattern: Applications should not use more than 10, 15 or 25 custom counters. 16 8/18/10
  • 17. Best Practices Map-Reduce – Total Order Outputs   Sampling Partitioner ›  Do not use a single reducer! ›  E.g. Terasort/Petasort benchmarks   Joining fully sorted data-sets ›  Do not need same cardinality e.g. number of buckets for the data-sets being joined   Pattern: Use combiners judiciously, ensure they really work! 17 8/18/10
  • 18. Best Practices HDFS – NameNode and JobTracker Operations   NameNode: Please don’t hurt me! ›  Not yet a silver bullet… ›  Do not perform metadata operations for map/reduce tasks at the backend   Do not contact for JobTracker for cluster statistics etc. from the backend   Pattern: Applications should not perform any metadata operations on the file-system from the backend, they should be confined to the job-client during job-submission. Furthermore, applications should be careful not to contact the JobTracker from the backend. 18 8/18/10
  • 19. Best Practices Map-Reduce – Logs and Web-UI   Tasks’ stdout/stderr stored on TaskTrackers ›  Limit amount of logs   JobTracker/NameNode Web-UI ›  Do not screen-scrape! 19 8/18/10
  • 20. Best Practices Oozie – Workflows   Production pipelines are run via Oozie   Ensure workflows have small number of medium-to-large sized Map-Reduce jobs ›  Collapse smaller jobs   Pattern: A single Map-Reduce job in a workflow should process at least a few tens of GB of data. 20 8/18/10
  • 21. Anti-Patterns In a large enough cluster, you see any and all of these…   Applications not using a higher-level interface such as Pig/Hive   Processing thousands of small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file.   Processing very large data-sets with small HDFS block size i.e. 128MB resulting in tens of thousands of maps.   Applications with a large number (thousands) of maps with a very small runtime (e.g. 5s).   Straight-forward aggregations without the use of the Combiner.   Applications with greater than 60,000-70,000 maps.   Applications processing large data-sets with very few reduces (e.g. 1). ›  Pig scripts processing large data-sets without using the PARALLEL keyword ›  Applications using a single reduce for total-order amount the output records 21 8/18/10
  • 22. Anti-Patterns   Applications processing data with very large number of reduces, such that each reduce processes less than 1-2GB of data.   Applications writing out multiple, small, output files from each reduce.   Applications using the DistributedCache to distribute a large number of artifacts and/or very large artifacts (hundreds of MBs each).   Applications using tens or hundreds of counters per task.   Applications performing metadata operations (e.g. listStatus) on the file-system from the map/reduce tasks.   Applications doing screen scraping of JobTracker web-ui for status of queues/jobs or worse, job-history of completed jobs.   Workflows comprising of hundreds or thousands of small jobs processing small amounts of data. Work underway in yahoo-hadoop-0.20.200 to prevent anti-patterns 22 8/18/10