Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/152RfbB.
Michael Kopp explains how to run performance code at scale with Hadoop and how to analyze and optimize Hadoop jobs. Filmed at qconnewyork.com.
Michael Kopp has over 12 years of experience as an architect and developer. He is a technology strategist in CompuwareAPM's center of excellence where he focuses on architecture and performance of cloud and big data environments. In this role he drives the dynaTrace Enterprise product strategy works closely with key customers in implementing APM in these environments.
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/optimize-hadoop-jobs
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
6. Effectiveness vs. Efficiency
• Effective: Adequate to accomplish a purpose; producing the
intended or expected result1
• Efficient: Performing or functioning in the best possible
manner with the least waste of time and effort1
…and resources
1) http://www.dailyblogtips.com/effective-vs-efficient-difference/
7. An Efficient Hadoop Cluster
• Is Effective Gets the job done (in time)
• Highly Utilized when Active (unused resources are wasted
resources)
8. What is an efficient Hadoop Job?
…efficiency is a measurable concept,
quantitatively determined
by the ratio of output to input…
• same output in less time
• less resource usage with same output
and same time
• more output with same resources
in the same time
Efficient jobs are effective without
adding more hardware!
14. Pushing the Boundaries – High Utilization
• Figure out Spill and Shuffle Bottlenecks
• Remove Idle Times, Wait Times, Sync Times
• Hotspot Analysis Tools can pinpoint those Items quickly
21. Performance Optimization
1. Identify Bounding Resource
2. Optimize and reduce its usage
3. Identify new Bounding Resource
Hot Spot Analysis Tools are again the best way to go
30. Map Reduce Run Comparison
10% of Mapping CPUReducers Running3 Reducers running
31. Conclusion
• Understanding your bottleneck!
• Understand bounding resource
• Small fixes can have huge yields…but requires tools
32. What else did we find?
• Short Mappers due to small files
– High merge time due to large number of spills
– Too much data shuffle add Combiner but…
• Tried Task reuse
– Nearly not effect?
– 5% less Map Time, but…?
33. Why did the resuse not help
Map Phase over
5 more reducersshuffle