SlideShare une entreprise Scribd logo
1  sur  18
Your natural partner to develop innovative
solutions

Nokia Institute of Technology

Nokia Internal Use Only
Agenda
Agenda
•
•
•

MapReduce Summarization Patterns
MapReduce Coding Best Practices
Ctrending MR Performance Evaluation
• Ctrending MR Execution Summary
• Code Profiling
• Profiling Results
• Code Tuning
• Hbase Configuration Tuning
• Tuning Results
• Refactoring Proposal

Nokia Internal Use Only
MapReduce Summarization Patterns
MapReduce Summarization Patterns
•

Numerical Summarizations
• General counting of data set records
• Groups records by a custom key, calculating numerical values per
group
• Known Uses
• Word count, record count, min/max count, avg/median/standard
deviation

Nokia Internal Use Only
MapReduce Summarization Patterns
MapReduce Summarization Patterns
•

Inverted Index
• Indexes large data set into keywords
• Mapper emits keywords/ids values and the framework handles most of
the work
• May use IdentityReducer
• Should benefit from Partitioner for load balance

Nokia Internal Use Only
MapReduce Summarization Patterns
MapReduce Summarization Patterns
•

•

Counting with Counters
• Leverages MapReduce framework’s counters.
• Counters are all stored in-memory locally on each Mapper, then aggregated by the
framework.
• Better performance, however may not exceed tens of counters definition.
Known Uses
• Count number of records, count small number of groups, summations

Nokia Internal Use Only
MapReduce Coding Best Practices
MapReduce Coding Best Practices
•

•

•

Define Output Values
• Create custom Writable extending classes to be used as output from
Mappers;
• Provides cleaner Mapper code and avoids String parsing on Reducer
code side;
Avoid Local Object Creation
• Map and Reduce methods are invoked on very large loops;
• Creating local objects inside map or reduce leads to huge number of
objects being attached to Eden space of Young Generation JVM’s
Heap;
• Reuse Global instances to decrease Young GC Activity;
Use Combiners on Counting Summarizations
• Combiners reduce bandwidth consuption, as it applies aggregations
locally to mappers node, before mapper output is sent to shuffle and
sort phase, then made available for reducers

Nokia Internal Use Only
Ctrending MR Performance Evaluation
Ctrending MR Performance Evaluation
• Ctrending MR Execution Summary
• Total MR Jobs Running: 8
• Avg of processed tweets: 2.2 Million
• Tweets identified as Music related: 10.5%
• Total Execution Time: 2 hours and 20 minutes
• Slowest MapReduces:
• Tweets Counter: 46 minutes
• Nokia Entity Id Join: 1 hour and 10 minutes

Nokia Internal Use Only
Ctrending MR Code Profiling
Ctrending MR Code Profiling
• Mainly applied to Nokia Id Join Mapper
• Added usage of MapReduce framework’s Counters to collect execution
time metrics
• Also used Counters to sum total of entities id being found in Nokia Id Join
mapper
• Needed to create Static fields in search strategy implementations to
collect execution time metric

Nokia Internal Use Only
Ctrending MR Profiling Results
Ctrending MR Profiling Results
TOTAL_ARTISTS_NMS_FOUND

77

TOTAL_ARTISTS_NOT_IN_CACHE
TOTAL_CANDIDATES_FORMATTING_TIME
TOTAL_HBASE_GET_TIME

1,904
67,873
262,647

TOTAL_NORMALIZATION_TIME

22,452

TOTAL_SEARCH_ARTIST_TIME

611,066

TOTAL_SEARCH_CALCULATION_TIME

5,605

TOTAL_SEARCH_NMS_TIME

3,740,552

TOTAL_SEARCH_TIME

4,098,270

TOTAL_SEARCH_TRACK_TIME

3,486,978

TOTAL_TRACKS_NMS_FOUND

145

TOTAL_TRACKS_NOT_IN_CACHE
Nokia Internal Use Only

4,635
Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Tweets Count MapReduce
• Applied IntSumReducer as combiner.
• Ajusted Hbase Scan to fetch and copy records on blocks of
thousands, in order to optimize network usage between nodes.
• Also set blockCache to false, as this table will always be read
sequentially at once.

Nokia Internal Use Only
Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Entity Id Search MapReduce
• Removed unnecessary split/indexof calls
• Removed redundant object creation from map method

Nokia Internal Use Only
Ctrending MR Code Tuning
Ctrending MR Code Tuning
• Tuning Entity Id Search MapReduce
• Profiling results shows that NMS Search is the bottleneck
• It costs more than 90% of all MapReduce execution time
• It also shows that NMS Search is not adding enough value
• It founds only 4% of Artists Ids not in cache
• It founds only 3% of Tracks Ids not in cache
• This drove the decision to remove NMS search by simply referencing
CustomCache ISearchStrategy implementation on Mapper setup
method

Nokia Internal Use Only
Hbase Configuration Tuning
Hbase Configuration Tuning
• Artists and Tracks Cache is an inverted indexes structure stored on
Hbase tables.
• These tables present high level of random access to it’s records (Get
operations), while Entity Id Search MapReduce performs searches on the
cache.
• This could have performance optimized if Cache table blocks were made
available in RegionServer’s memory.
• Hbase provides Table level configuration property that increases blocks
priority to be stored on RegionServer’s memory

Nokia Internal Use Only
Hbase Configuration Tuning
Hbase Configuration Tuning
• Additional configuration is required on Hbase RegionServer, so that
block cache is possible most part of the time.
• hbase.regionserver.global.memstore.upperLimit -> defines maximum
% of Heap available for writing in memstores, before put operations
are actually written to disk files.
• hbase.regionserver.global.memstore.lowerLimit -> defines minimum
% of Heap available for writing in memstores. Flush operations will
free memstore until this limit is reached.
• hfile.block.cache.size -> % of Heap to be used to store blocks inmemory

Nokia Internal Use Only
Hbase Configuration Tuning
Hbase Configuration Tuning
• Most Ctrending Hbase put operations are done in batch jobs
(Twitter Crawler).
• Music entities cache requires many Get operations, while
EntityIdSearchMR is executing.
• Simply setting cache tables to be maintained in-memory does not
work, if there is not enough memory available.
• More memory can be made available to cache tables blocks on
RegionServers by decreasing % of Heap reserved to memstore
and increasing it for block cache.

Nokia Internal Use Only
Ctrending MR Tuning Results
Ctrending MR Tuning Results
• TweetsCountMR
• Total Execution Time Prior Tuning: 46 minutes (average)
• Total Execution Time After Tuning: 20 minutes (average)
• EntityIdSearchMR
• Total Execution Time Prior Tuning: 1 hour and 10 minutes (average)
• Total Execution Time Adter Tuning: 6 minutes (average)
• CONCLUSION: Do not ever perform HTTP Requests on MapReduces
again!!!

Nokia Internal Use Only
Refactoring
Refactoring
• Write batch process to read generated rankings and perform requests
to NMS for music entities which ID was not found.
• Better implement this as a Java multi-thread standalone process,
instead of MapReduce
• As input file is small (the filtered rank), Hadoop default InputFormat
implementations will not split it in many Map tasks.
• Unless a custom InputFormat be implemented, develop a
MapReduce for this will probably take long time to execute, as it will
end up with a single Map task to request NMS for all unknown Ids
• Optimize Heap usage on other MRs by avoiding Object creation on
Map methods.
• Enhance code quality (and even performance), by defining
OutputValues for Trending MRs

Nokia Internal Use Only
References
References
• HBase, The Definitive Guide, Lars George, O'Reilly
• MapReduce Design Patterns, Donald Miner, Adam Shook
• Hadoop Official WebSite
• http://hadoop.apache.org/

Nokia Internal Use Only

Contenu connexe

Tendances

Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Adam Doyle
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnMike Frampton
 
Suggested Algorithm to improve Hadoop's performance.
Suggested Algorithm to improve Hadoop's performance.Suggested Algorithm to improve Hadoop's performance.
Suggested Algorithm to improve Hadoop's performance.Meshal Albeedhani
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupRommel Garcia
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopHortonworks
 
Introduction to YARN Apps
Introduction to YARN AppsIntroduction to YARN Apps
Introduction to YARN AppsCloudera, Inc.
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Cloumon Product Introduction
Cloumon Product IntroductionCloumon Product Introduction
Cloumon Product IntroductionGruter
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentationVu Thi Trang
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 

Tendances (20)

Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016Back to School - St. Louis Hadoop Meetup September 2016
Back to School - St. Louis Hadoop Meetup September 2016
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
Suggested Algorithm to improve Hadoop's performance.
Suggested Algorithm to improve Hadoop's performance.Suggested Algorithm to improve Hadoop's performance.
Suggested Algorithm to improve Hadoop's performance.
 
Yarn
YarnYarn
Yarn
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Resource scheduling
Resource schedulingResource scheduling
Resource scheduling
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
Introduction to YARN Apps
Introduction to YARN AppsIntroduction to YARN Apps
Introduction to YARN Apps
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Cloumon Product Introduction
Cloumon Product IntroductionCloumon Product Introduction
Cloumon Product Introduction
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 

En vedette

Board presentation wilson healthsystem_ipd_prevention.ppt
Board presentation wilson healthsystem_ipd_prevention.pptBoard presentation wilson healthsystem_ipd_prevention.ppt
Board presentation wilson healthsystem_ipd_prevention.pptjacoleman
 
Board presentation wilson healthsystem_ipd_prevention.ppt
Board presentation wilson healthsystem_ipd_prevention.pptBoard presentation wilson healthsystem_ipd_prevention.ppt
Board presentation wilson healthsystem_ipd_prevention.pptjacoleman
 
1 financial statements
1 financial statements1 financial statements
1 financial statements3axap
 
1 balance sheet
1 balance sheet1 balance sheet
1 balance sheet3axap
 

En vedette (10)

Board presentation wilson healthsystem_ipd_prevention.ppt
Board presentation wilson healthsystem_ipd_prevention.pptBoard presentation wilson healthsystem_ipd_prevention.ppt
Board presentation wilson healthsystem_ipd_prevention.ppt
 
Centro escolar insa
Centro escolar insaCentro escolar insa
Centro escolar insa
 
708studyguide
708studyguide708studyguide
708studyguide
 
Board presentation wilson healthsystem_ipd_prevention.ppt
Board presentation wilson healthsystem_ipd_prevention.pptBoard presentation wilson healthsystem_ipd_prevention.ppt
Board presentation wilson healthsystem_ipd_prevention.ppt
 
1 financial statements
1 financial statements1 financial statements
1 financial statements
 
1 balance sheet
1 balance sheet1 balance sheet
1 balance sheet
 
711studyguide
711studyguide711studyguide
711studyguide
 
Bombas
Bombas Bombas
Bombas
 
Bhutto zia and_islam
Bhutto zia and_islamBhutto zia and_islam
Bhutto zia and_islam
 
Lock out tag out procedures
Lock out  tag out proceduresLock out  tag out procedures
Lock out tag out procedures
 

Similaire à Develop innovative solutions with Nokia Institute of Technology

AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxneju3
 
NoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementNoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementDATAVERSITY
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB InternalsSiraj Memon
 
Cignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysCignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysMongoDB APAC
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSPC Adriatics
 
Budapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User GroupBudapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User GroupMarc Schwering
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014Dylan Tong
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...Naoki (Neo) SATO
 

Similaire à Develop innovative solutions with Nokia Institute of Technology (20)

AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Mongo db 3.4 Overview
Mongo db 3.4 OverviewMongo db 3.4 Overview
Mongo db 3.4 Overview
 
Cloud design principles
Cloud design principlesCloud design principles
Cloud design principles
 
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptxPPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
PPT-UEU-Database-Objek-Terdistribusi-Pertemuan-8.pptx
 
NoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application EnablementNoSQL – Data Center Centric Application Enablement
NoSQL – Data Center Centric Application Enablement
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
moi-connect16
moi-connect16moi-connect16
moi-connect16
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB Internals
 
Cignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdaysCignex mongodb-sharding-mongodbdays
Cignex mongodb-sharding-mongodbdays
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Budapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User GroupBudapest Spring MUG 2016 - MongoDB User Group
Budapest Spring MUG 2016 - MongoDB User Group
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...Azure Application Architecture Guide ~Design principles for Azure application...
Azure Application Architecture Guide ~Design principles for Azure application...
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 

Plus de wchevreuil

Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdfCloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdfwchevreuil
 
HBase System Tables / Metadata Info
HBase System Tables / Metadata InfoHBase System Tables / Metadata Info
HBase System Tables / Metadata Infowchevreuil
 
HDFS client write/read implementation details
HDFS client write/read implementation detailsHDFS client write/read implementation details
HDFS client write/read implementation detailswchevreuil
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trencheswchevreuil
 
Hbasecon2019 hbck2 (1)
Hbasecon2019 hbck2 (1)Hbasecon2019 hbck2 (1)
Hbasecon2019 hbck2 (1)wchevreuil
 
Web hdfs and httpfs
Web hdfs and httpfsWeb hdfs and httpfs
Web hdfs and httpfswchevreuil
 
HBase replication
HBase replicationHBase replication
HBase replicationwchevreuil
 
I nd t_bigdata(1)
I nd t_bigdata(1)I nd t_bigdata(1)
I nd t_bigdata(1)wchevreuil
 
Hadoop - TDC 2012
Hadoop - TDC 2012Hadoop - TDC 2012
Hadoop - TDC 2012wchevreuil
 

Plus de wchevreuil (10)

Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdfCloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
 
HBase System Tables / Metadata Info
HBase System Tables / Metadata InfoHBase System Tables / Metadata Info
HBase System Tables / Metadata Info
 
HDFS client write/read implementation details
HDFS client write/read implementation detailsHDFS client write/read implementation details
HDFS client write/read implementation details
 
HBase RITs
HBase RITsHBase RITs
HBase RITs
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
Hbasecon2019 hbck2 (1)
Hbasecon2019 hbck2 (1)Hbasecon2019 hbck2 (1)
Hbasecon2019 hbck2 (1)
 
Web hdfs and httpfs
Web hdfs and httpfsWeb hdfs and httpfs
Web hdfs and httpfs
 
HBase replication
HBase replicationHBase replication
HBase replication
 
I nd t_bigdata(1)
I nd t_bigdata(1)I nd t_bigdata(1)
I nd t_bigdata(1)
 
Hadoop - TDC 2012
Hadoop - TDC 2012Hadoop - TDC 2012
Hadoop - TDC 2012
 

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Develop innovative solutions with Nokia Institute of Technology

  • 1. Your natural partner to develop innovative solutions Nokia Institute of Technology Nokia Internal Use Only
  • 2. Agenda Agenda • • • MapReduce Summarization Patterns MapReduce Coding Best Practices Ctrending MR Performance Evaluation • Ctrending MR Execution Summary • Code Profiling • Profiling Results • Code Tuning • Hbase Configuration Tuning • Tuning Results • Refactoring Proposal Nokia Internal Use Only
  • 3. MapReduce Summarization Patterns MapReduce Summarization Patterns • Numerical Summarizations • General counting of data set records • Groups records by a custom key, calculating numerical values per group • Known Uses • Word count, record count, min/max count, avg/median/standard deviation Nokia Internal Use Only
  • 4. MapReduce Summarization Patterns MapReduce Summarization Patterns • Inverted Index • Indexes large data set into keywords • Mapper emits keywords/ids values and the framework handles most of the work • May use IdentityReducer • Should benefit from Partitioner for load balance Nokia Internal Use Only
  • 5. MapReduce Summarization Patterns MapReduce Summarization Patterns • • Counting with Counters • Leverages MapReduce framework’s counters. • Counters are all stored in-memory locally on each Mapper, then aggregated by the framework. • Better performance, however may not exceed tens of counters definition. Known Uses • Count number of records, count small number of groups, summations Nokia Internal Use Only
  • 6. MapReduce Coding Best Practices MapReduce Coding Best Practices • • • Define Output Values • Create custom Writable extending classes to be used as output from Mappers; • Provides cleaner Mapper code and avoids String parsing on Reducer code side; Avoid Local Object Creation • Map and Reduce methods are invoked on very large loops; • Creating local objects inside map or reduce leads to huge number of objects being attached to Eden space of Young Generation JVM’s Heap; • Reuse Global instances to decrease Young GC Activity; Use Combiners on Counting Summarizations • Combiners reduce bandwidth consuption, as it applies aggregations locally to mappers node, before mapper output is sent to shuffle and sort phase, then made available for reducers Nokia Internal Use Only
  • 7. Ctrending MR Performance Evaluation Ctrending MR Performance Evaluation • Ctrending MR Execution Summary • Total MR Jobs Running: 8 • Avg of processed tweets: 2.2 Million • Tweets identified as Music related: 10.5% • Total Execution Time: 2 hours and 20 minutes • Slowest MapReduces: • Tweets Counter: 46 minutes • Nokia Entity Id Join: 1 hour and 10 minutes Nokia Internal Use Only
  • 8. Ctrending MR Code Profiling Ctrending MR Code Profiling • Mainly applied to Nokia Id Join Mapper • Added usage of MapReduce framework’s Counters to collect execution time metrics • Also used Counters to sum total of entities id being found in Nokia Id Join mapper • Needed to create Static fields in search strategy implementations to collect execution time metric Nokia Internal Use Only
  • 9. Ctrending MR Profiling Results Ctrending MR Profiling Results TOTAL_ARTISTS_NMS_FOUND 77 TOTAL_ARTISTS_NOT_IN_CACHE TOTAL_CANDIDATES_FORMATTING_TIME TOTAL_HBASE_GET_TIME 1,904 67,873 262,647 TOTAL_NORMALIZATION_TIME 22,452 TOTAL_SEARCH_ARTIST_TIME 611,066 TOTAL_SEARCH_CALCULATION_TIME 5,605 TOTAL_SEARCH_NMS_TIME 3,740,552 TOTAL_SEARCH_TIME 4,098,270 TOTAL_SEARCH_TRACK_TIME 3,486,978 TOTAL_TRACKS_NMS_FOUND 145 TOTAL_TRACKS_NOT_IN_CACHE Nokia Internal Use Only 4,635
  • 10. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Tweets Count MapReduce • Applied IntSumReducer as combiner. • Ajusted Hbase Scan to fetch and copy records on blocks of thousands, in order to optimize network usage between nodes. • Also set blockCache to false, as this table will always be read sequentially at once. Nokia Internal Use Only
  • 11. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Entity Id Search MapReduce • Removed unnecessary split/indexof calls • Removed redundant object creation from map method Nokia Internal Use Only
  • 12. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Entity Id Search MapReduce • Profiling results shows that NMS Search is the bottleneck • It costs more than 90% of all MapReduce execution time • It also shows that NMS Search is not adding enough value • It founds only 4% of Artists Ids not in cache • It founds only 3% of Tracks Ids not in cache • This drove the decision to remove NMS search by simply referencing CustomCache ISearchStrategy implementation on Mapper setup method Nokia Internal Use Only
  • 13. Hbase Configuration Tuning Hbase Configuration Tuning • Artists and Tracks Cache is an inverted indexes structure stored on Hbase tables. • These tables present high level of random access to it’s records (Get operations), while Entity Id Search MapReduce performs searches on the cache. • This could have performance optimized if Cache table blocks were made available in RegionServer’s memory. • Hbase provides Table level configuration property that increases blocks priority to be stored on RegionServer’s memory Nokia Internal Use Only
  • 14. Hbase Configuration Tuning Hbase Configuration Tuning • Additional configuration is required on Hbase RegionServer, so that block cache is possible most part of the time. • hbase.regionserver.global.memstore.upperLimit -> defines maximum % of Heap available for writing in memstores, before put operations are actually written to disk files. • hbase.regionserver.global.memstore.lowerLimit -> defines minimum % of Heap available for writing in memstores. Flush operations will free memstore until this limit is reached. • hfile.block.cache.size -> % of Heap to be used to store blocks inmemory Nokia Internal Use Only
  • 15. Hbase Configuration Tuning Hbase Configuration Tuning • Most Ctrending Hbase put operations are done in batch jobs (Twitter Crawler). • Music entities cache requires many Get operations, while EntityIdSearchMR is executing. • Simply setting cache tables to be maintained in-memory does not work, if there is not enough memory available. • More memory can be made available to cache tables blocks on RegionServers by decreasing % of Heap reserved to memstore and increasing it for block cache. Nokia Internal Use Only
  • 16. Ctrending MR Tuning Results Ctrending MR Tuning Results • TweetsCountMR • Total Execution Time Prior Tuning: 46 minutes (average) • Total Execution Time After Tuning: 20 minutes (average) • EntityIdSearchMR • Total Execution Time Prior Tuning: 1 hour and 10 minutes (average) • Total Execution Time Adter Tuning: 6 minutes (average) • CONCLUSION: Do not ever perform HTTP Requests on MapReduces again!!! Nokia Internal Use Only
  • 17. Refactoring Refactoring • Write batch process to read generated rankings and perform requests to NMS for music entities which ID was not found. • Better implement this as a Java multi-thread standalone process, instead of MapReduce • As input file is small (the filtered rank), Hadoop default InputFormat implementations will not split it in many Map tasks. • Unless a custom InputFormat be implemented, develop a MapReduce for this will probably take long time to execute, as it will end up with a single Map task to request NMS for all unknown Ids • Optimize Heap usage on other MRs by avoiding Object creation on Map methods. • Enhance code quality (and even performance), by defining OutputValues for Trending MRs Nokia Internal Use Only
  • 18. References References • HBase, The Definitive Guide, Lars George, O'Reilly • MapReduce Design Patterns, Donald Miner, Adam Shook • Hadoop Official WebSite • http://hadoop.apache.org/ Nokia Internal Use Only

Notes de l'éditeur

  1. {}