3. MapReduce Summarization Patterns
MapReduce Summarization Patterns
•
Numerical Summarizations
• General counting of data set records
• Groups records by a custom key, calculating numerical values per
group
• Known Uses
• Word count, record count, min/max count, avg/median/standard
deviation
Nokia Internal Use Only
4. MapReduce Summarization Patterns
MapReduce Summarization Patterns
•
Inverted Index
• Indexes large data set into keywords
• Mapper emits keywords/ids values and the framework handles most of
the work
• May use IdentityReducer
• Should benefit from Partitioner for load balance
Nokia Internal Use Only
5. MapReduce Summarization Patterns
MapReduce Summarization Patterns
•
•
Counting with Counters
• Leverages MapReduce framework’s counters.
• Counters are all stored in-memory locally on each Mapper, then aggregated by the
framework.
• Better performance, however may not exceed tens of counters definition.
Known Uses
• Count number of records, count small number of groups, summations
Nokia Internal Use Only
6. MapReduce Coding Best Practices
MapReduce Coding Best Practices
•
•
•
Define Output Values
• Create custom Writable extending classes to be used as output from
Mappers;
• Provides cleaner Mapper code and avoids String parsing on Reducer
code side;
Avoid Local Object Creation
• Map and Reduce methods are invoked on very large loops;
• Creating local objects inside map or reduce leads to huge number of
objects being attached to Eden space of Young Generation JVM’s
Heap;
• Reuse Global instances to decrease Young GC Activity;
Use Combiners on Counting Summarizations
• Combiners reduce bandwidth consuption, as it applies aggregations
locally to mappers node, before mapper output is sent to shuffle and
sort phase, then made available for reducers
Nokia Internal Use Only
7. Ctrending MR Performance Evaluation
Ctrending MR Performance Evaluation
• Ctrending MR Execution Summary
• Total MR Jobs Running: 8
• Avg of processed tweets: 2.2 Million
• Tweets identified as Music related: 10.5%
• Total Execution Time: 2 hours and 20 minutes
• Slowest MapReduces:
• Tweets Counter: 46 minutes
• Nokia Entity Id Join: 1 hour and 10 minutes
Nokia Internal Use Only
8. Ctrending MR Code Profiling
Ctrending MR Code Profiling
• Mainly applied to Nokia Id Join Mapper
• Added usage of MapReduce framework’s Counters to collect execution
time metrics
• Also used Counters to sum total of entities id being found in Nokia Id Join
mapper
• Needed to create Static fields in search strategy implementations to
collect execution time metric
Nokia Internal Use Only
10. Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Tweets Count MapReduce
• Applied IntSumReducer as combiner.
• Ajusted Hbase Scan to fetch and copy records on blocks of
thousands, in order to optimize network usage between nodes.
• Also set blockCache to false, as this table will always be read
sequentially at once.
Nokia Internal Use Only
11. Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Entity Id Search MapReduce
• Removed unnecessary split/indexof calls
• Removed redundant object creation from map method
Nokia Internal Use Only
12. Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Entity Id Search MapReduce
• Profiling results shows that NMS Search is the bottleneck
• It costs more than 90% of all MapReduce execution time
• It also shows that NMS Search is not adding enough value
• It founds only 4% of Artists Ids not in cache
• It founds only 3% of Tracks Ids not in cache
• This drove the decision to remove NMS search by simply referencing
CustomCache ISearchStrategy implementation on Mapper setup
method
Nokia Internal Use Only
13. Hbase Configuration Tuning
Hbase Configuration Tuning
• Artists and Tracks Cache is an inverted indexes structure stored on
Hbase tables.
• These tables present high level of random access to it’s records (Get
operations), while Entity Id Search MapReduce performs searches on the
cache.
• This could have performance optimized if Cache table blocks were made
available in RegionServer’s memory.
• Hbase provides Table level configuration property that increases blocks
priority to be stored on RegionServer’s memory
Nokia Internal Use Only
14. Hbase Configuration Tuning
Hbase Configuration Tuning
• Additional configuration is required on Hbase RegionServer, so that
block cache is possible most part of the time.
• hbase.regionserver.global.memstore.upperLimit -> defines maximum
% of Heap available for writing in memstores, before put operations
are actually written to disk files.
• hbase.regionserver.global.memstore.lowerLimit -> defines minimum
% of Heap available for writing in memstores. Flush operations will
free memstore until this limit is reached.
• hfile.block.cache.size -> % of Heap to be used to store blocks inmemory
Nokia Internal Use Only
15. Hbase Configuration Tuning
Hbase Configuration Tuning
• Most Ctrending Hbase put operations are done in batch jobs
(Twitter Crawler).
• Music entities cache requires many Get operations, while
EntityIdSearchMR is executing.
• Simply setting cache tables to be maintained in-memory does not
work, if there is not enough memory available.
• More memory can be made available to cache tables blocks on
RegionServers by decreasing % of Heap reserved to memstore
and increasing it for block cache.
Nokia Internal Use Only
16. Ctrending MR Tuning Results
Ctrending MR Tuning Results
• TweetsCountMR
• Total Execution Time Prior Tuning: 46 minutes (average)
• Total Execution Time After Tuning: 20 minutes (average)
• EntityIdSearchMR
• Total Execution Time Prior Tuning: 1 hour and 10 minutes (average)
• Total Execution Time Adter Tuning: 6 minutes (average)
• CONCLUSION: Do not ever perform HTTP Requests on MapReduces
again!!!
Nokia Internal Use Only
17. Refactoring
Refactoring
• Write batch process to read generated rankings and perform requests
to NMS for music entities which ID was not found.
• Better implement this as a Java multi-thread standalone process,
instead of MapReduce
• As input file is small (the filtered rank), Hadoop default InputFormat
implementations will not split it in many Map tasks.
• Unless a custom InputFormat be implemented, develop a
MapReduce for this will probably take long time to execute, as it will
end up with a single Map task to request NMS for all unknown Ids
• Optimize Heap usage on other MRs by avoiding Object creation on
Map methods.
• Enhance code quality (and even performance), by defining
OutputValues for Trending MRs
Nokia Internal Use Only
18. References
References
• HBase, The Definitive Guide, Lars George, O'Reilly
• MapReduce Design Patterns, Donald Miner, Adam Shook
• Hadoop Official WebSite
• http://hadoop.apache.org/
Nokia Internal Use Only