•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
2. Hello!
Who am I?
Yahoo!
› Grid Team (CCDI)
› Lead the Apache Hadoop Map-Reduce Development Team
Apache
› Developer on Apache Hadoop since April 2006
› Committer
› Member of Apache Hadoop PMC
2 8/18/10
3. Apache Hadoop
The Software
Hadoop Distributed File System
Hadoop Map-Reduce
Open source from Apache
Written in Java
Runs on
› Linux, Solaris, Mac OS/X
› Commodity hardware
3 8/18/10
4. Storage
HDFS
Designed to store large files
Stores files as large blocks (64 to 128 MB)
Each block stored on multiple servers
Data is automatically re-replicated on need
Accessed from command line, Java API or C API
4 8/18/10
5. Data Processing
Hadoop Map-Reduce
Map-Reduce is a programming model for efficient distributed computing
Efficiency from
› Streaming through data, reducing seeks
› Pipelining
A good fit for a lot of applications
› Log processing
› Web index building
5 8/18/10
6. Hadoop in the Enterprise
Usage and Importance
Large number of corporations use Apache Hadoop at scale for several business critical
applications
› Large, shared, multi-tenant deployments to minimize fragmentation across organizations
Millions of dollars at stake!
› Yahoo
• Advertising, Search
• 40,000 machines and counting
http://wiki.apache.org/hadoop/PoweredBy
6 8/18/10
7. Hadoop in the Enterprise
… however
Hadoop isn’t a silver bullet (at least as yet!)
› Hadoop still depends on users to utilize it effectively
› Pig/Hive help, one can still write badly suited queries
Need to adapt legacy applications to Hadoop, especially the Map-Reduce paradigm
Efficient usage of Hadoop clusters is critical to getting return on the investment
7 8/18/10
8. Hadoop Map-Reduce
Overview
It works like a Unix pipeline:
› cat input | grep | sort | unique -c | cat > output
› Input | Map | Shuffle & Sort | Reduce | Output
Works on key/value pairs
› map <k1, v1> -> <k2, v2>
› reduce <k2, v2> -> <k3, v3>
8 8/18/10
9. Best Practices
Input to Applications
Optimized to process large data-sets
Pattern: Coalesce processing of multiple small input files into smaller number of maps
and use larger HDFS block-sizes for processing very large data-sets.
9 8/18/10
10. Best Practices
Map-Reduce - Mappers
Process multiple-files per map for jobs with very large number of small input files
Process large chunks of data per-map for large-scale data-processing
› PetaSort – 66,000 maps with 12.5G per map
Pattern: Unless the application's maps are heavily CPU bound, there is almost no
reason to ever require more than 60,000-70,000 maps for a single application.
10 8/18/10
11. Best Practices
Map-Reduce - Mappers
Process multiple-files per map for jobs with very large number of small input files
Process large chunks of data per-map for large-scale data-processing
› PetaSort – 66,000 maps with 12.5G per map
The shuffle cross-bar (maps * reduces) is a key performance factor
Pattern: Applications should use fewer maps to process data in parallel, as few as
possible without having really bad failure recovery cases.
› Unless the application's maps are heavily CPU bound, there is almost no reason to ever require
more than 60,000-70,000 maps for a single application
11 8/18/10
12. Best Practices
Map-Reduce – Combiner and Shuffle
Combiner
› Map-side aggregation to help reduce network traffic for the shuffle
› Cost of using combiners
Shuffle
› Compression of intermediate output
Pattern: Use combiners judiciously, ensure they really work! Compress intermediate
outputs
12 8/18/10
13. Best Practices
Map-Reduce – Reducers
Efficiency depends on shuffle, and the cross-bar
Configure appropriate number of reduces
› Too few reduces hurt the nodes
› Too many hurt the cross-bar
Pattern: Applications should ensure that each reduce should process at least 1-2 GB of
data, and at most 5-10GB of data, in most scenarios.
13 8/18/10
14. Best Practices
Map-Reduce – Output
Number of output artifacts is linear w.r.t. number of configured reduces
Compress outputs
Use appropriate file-formats for the output
› E.g. compressed text-files is not a great idea if you aren’t using a splittable codec
Think of the consumer of your data-set!
Consider using larger HDFS block-sizes.
Pattern: : Application outputs to be few large files, with each file spanning multiple
HDFS blocks and appropriately compressed.
14 8/18/10
15. Best Practices
Map-Reduce – Distributed Cache
Efficient distribution of read-only files for applications
Designed for small number of mid-sized files
Pattern: Applications should ensure that artifacts in the distributed-cache should not
require more i/o than the actual input to the application tasks
15 8/18/10
16. Best Practices
Map-Reduce – Counters
Global (across all tasks) counters, aggregated by the framework
Expensive!
Pattern: Applications should not use more than 10, 15 or 25 custom counters.
16 8/18/10
17. Best Practices
Map-Reduce – Total Order Outputs
Sampling Partitioner
› Do not use a single reducer!
› E.g. Terasort/Petasort benchmarks
Joining fully sorted data-sets
› Do not need same cardinality e.g. number of buckets for the data-sets being joined
Pattern: Use combiners judiciously, ensure they really work!
17 8/18/10
18. Best Practices
HDFS – NameNode and JobTracker Operations
NameNode: Please don’t hurt me!
› Not yet a silver bullet…
› Do not perform metadata operations for map/reduce tasks at the backend
Do not contact for JobTracker for cluster statistics etc. from the backend
Pattern: Applications should not perform any metadata operations on the file-system
from the backend, they should be confined to the job-client during job-submission.
Furthermore, applications should be careful not to contact the JobTracker from the
backend.
18 8/18/10
19. Best Practices
Map-Reduce – Logs and Web-UI
Tasks’ stdout/stderr stored on TaskTrackers
› Limit amount of logs
JobTracker/NameNode Web-UI
› Do not screen-scrape!
19 8/18/10
20. Best Practices
Oozie – Workflows
Production pipelines are run via Oozie
Ensure workflows have small number of medium-to-large sized Map-Reduce jobs
› Collapse smaller jobs
Pattern: A single Map-Reduce job in a workflow should process at least a few tens of
GB of data.
20 8/18/10
21. Anti-Patterns
In a large enough cluster, you see any and all of these…
Applications not using a higher-level interface such as Pig/Hive
Processing thousands of small files (sized less than 1 HDFS block, typically 128MB)
with one map processing a single small file.
Processing very large data-sets with small HDFS block size i.e. 128MB resulting in tens
of thousands of maps.
Applications with a large number (thousands) of maps with a very small runtime (e.g.
5s).
Straight-forward aggregations without the use of the Combiner.
Applications with greater than 60,000-70,000 maps.
Applications processing large data-sets with very few reduces (e.g. 1).
› Pig scripts processing large data-sets without using the PARALLEL keyword
› Applications using a single reduce for total-order amount the output records
21 8/18/10
22. Anti-Patterns
Applications processing data with very large number of reduces, such that each reduce
processes less than 1-2GB of data.
Applications writing out multiple, small, output files from each reduce.
Applications using the DistributedCache to distribute a large number of artifacts and/or
very large artifacts (hundreds of MBs each).
Applications using tens or hundreds of counters per task.
Applications performing metadata operations (e.g. listStatus) on the file-system from
the map/reduce tasks.
Applications doing screen scraping of JobTracker web-ui for status of queues/jobs or
worse, job-history of completed jobs.
Workflows comprising of hundreds or thousands of small jobs processing small
amounts of data.
Work underway in yahoo-hadoop-0.20.200 to prevent anti-patterns
22 8/18/10