SlideShare une entreprise Scribd logo
1  sur  26
Hadoop MapReduce
How to Survive Out-of-Memory Errors
Member: Yoonseung Choi
Soyeong Park
Faculty Mentor: Prof. Harry Xu
Student Mentor: Khanh Nguyen
The International Summer Undergraduate Research Fellowship
1
Outline
• Introduction
• What is MapReduce?
• How does MapReduce work?
• Limitations of MapReduce
• What are our goals?
• Operation test
• Conclusions
2
“There was 5 exabytes of information created
between the dawn of civilization through 2003,
But that much information is now created
every 2 days, and the pace is increasing...”
- Eric Schmidt, The Former Google CEO
3
Data scientists want
to analyze these
large data sets
But single
machines
have
limitations
in processing
these data sets
How can we handle that?
Furthermore, data sets
are now growing very rapidly
We don’t want
to understand
parallelization,
fault tolerance,
data distribution,
and load balancing!
Distributed processing
Therefore, we purpose
The ‘MapReduce’
parallelization
fault tolerance
data distribution
load balancing
4
MapReduce is
a programming model for
processing large data sets
Many real world tasks are
expressible in this model
The model is easy to use, even
for programmers without
experience with parallel and
distributed systems
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”.
* https://en.wikipedia.org/wiki/Apache_Hadoop
MapReduce Layer
HDFS Layer
5
What is MapReduce?
Mapper takes an input
and produces a set
of intermediate
key/value pairs
Reducer merges together
these intermediate values
associated with the same
intermediate key
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. p.12
6
How does MapReduce work?
The cat sees the dog, and the dog sees the cat.
The cat sees the dog
Andthedogseesthecat
cat, 1
dog, 1
sees, 1
the, 2
cat, 1
dog, 1
sees, 1
the, 2
and, 1
cat, 2
dog, 2
sees, 2
the, 4
and, 1
- Wordcount program
- A sentence is split
into two map tasks
Map Phase
Reduce
Phase
7
Limitations of MapReduce
There are many reasons for poor performance
And even experts sometimes can’t figure them out
8
What are our goals?
• Research Out-of-Memory Error(OOM) cases
• Document OOM cases
• Implement and simulate StackOverflow OOM cases
• Develop solutions for such OOM cases
… all done!!
9
Two Categories
1. Inappropriate Configuration
Configuration which causes poor performance
2. Large Intermediate Results
Temporary data structure grows too large
[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences.
10
Operation test environments
1. Standalone & Pseudo-distributed mode
- ‘14 MacBook Pro, 2.8 GHz Intel Core i5
8GB 1600 MHz DDR3, 500GB HDD
- ‘12 MacBook Air 1.4, GHz Intel Core i5
4GB 1600 MHz DDR3, 256GB HDD
2. Fully-distributed mode
- Raspberry Pi 2 Model B (3 nodes)
A quad-core ARM Cortex-A7 CPU (1Ghz Overclock)
1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet
11
Split size variation [Single node]
* ‘14 MacBook Pro 2.8, GHz Intel Core i5, 8GB 1600 MHz DDR3, 500GB SSD
Input: StackOverflow’s users profiles (1GB)
173.3
88.3
47.3
26.7 24.3
204
117.3
86.3
64.7
56.3
169.3
117.3
78.7
59
55
0
50
100
150
200
16 32 64 128 256
(sec)
169.7
85.7
43
23
23.3
172.7
103.7
64.7 48.7
37.7
129.7
77.7 55 39
32.7
0
50
100
150
200
16 32 64 128 256
[ Distributed grep (no Reducer) ][ Standard deviation of users’ age ]
(MB)
(sec)
(MB)
Standalone Pseudo-distributed
(2Mapper 2Reducer)
Pseudo-distributed
(4Mapper 4Reducer)
12
[ ]
Split size variation [Single node]
* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD
Input: StackOverflow’s Comments (8.5GB)
1577.7
807.7
425
411
312.3
1586.3
831
634
454.3
299
1590
803.7
540.3
397.7
323
250
550
850
1150
1450
16 32 64 128 256
Standard deviation of
comment’s text length
Count Min
and Max value
Standalone Pseudo-distributed
(2Mapper 2Reducer)
Pseudo-distributed
(4Mapper 4Reducer)
1469
783
398 389.3
281.3
1614
610.7 612
418.7
294.3
1598
609 488
362.7254.3
250
550
850
1150
1450
1750
16 32 64 128 256
[ ]
(MB)(MB)
(sec) (sec)
13
Split size variation [Fully-distributed]
Input: StackOverflow’s users profiles (1GB)
375
396
442
548
313 296
350
557
0
200
400
600
800
32 64 128 256
* Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock)
1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet
462.7 428.7
476.7
561.7 604
333.3 303
345
33…
603
0
200
400
600
800
16 32 64 128 256
[ Distributed grep (no Reducer) ][ average users’ age based on countries ]
6 Mapper 12 Mapper
(MB)(MB)
(sec) (sec)
14
io.sort.mb variation [Single node]
* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD
Input: StackOverflow’s Comments (8.5GB)
Test program: Standard deviation of comment’s text length
872
827 814 798
803.7
661 638.7 632 629.7 629.7
633.7 641 635.7 629.3 629600
700
800
900
20 40 80 160 320
Stand alone Pseudo distributed; 2M2R Pseudo distributed; 4M2R
(MB)
(sec)
15
I am working well with small datasets like 200-500MB.
But for datasets above 1GB, I am getting an error like this:
* http://stackoverflow.com/questions/23042829/getting-java-heap-space-error-while-running-a-mapreduce-code-for-large-dataset
2. Large Intermediate Results
16
Problem Investigation
Splited
Input files
Task 1
Task 2
Task 3
Task 4
Task 5
[K, V]
[K, V]
[K, V]
[K, V]
[K, V]
The Mapper
Intermediate
key/value pairs
1.3
GB
4.8
GB
almost
1 GB
17
Problem Investigation
[K, V]
[K, V]
[K, V]
[K, V]
[K, V]
The Reducer
Intermediate
key/value pairs
4.8
GB
almost
1 GB
I just have
1GB heap
space!
almost
1 GB
Java heap can’t contain
intermediate data structure
18
Configuration was:
1.3GB Input, 256MB Split size, 1024MB Java Heap Space
Error: Java heap space
19
Summary of Solutions
• Modify the configuration parameters
• Alter the program’s algorithm
: Some alternative solution was suggested from the site
-> Succeed with original version failed Configuration
( 256MB Split size & 1024MB Java heap size )
Java Heap size 1024MB 2048MB
Split size
128 MB Successful Successful
256 MB Failed Successful
20
Conclusions
• How to solve the poor performance
1. Adjust ‘split size’ & ‘sort space’
- the more size, the less time to spend
2. Adjust the number of Mapper
- Utilize all CPU Cores
- Larger number of mapper not always right
• If intermediate data structure is too large,
- Modify the configuration parameter or
- Alter the program’s algorithm
21
References
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified
Data Processing on Large Clusters”. [Online].
Available: http://static.googleusercontent.com/media/research.google.com/ko//archive/mapreduce-osdi04.pdf
[2] 한기용, Do it! 직접 해보는 하둡 프로그래밍. Seoul: EasysPublishing,
2013.
[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce
jobs, Chinese Academy of Sciences.
[4] Donald Miner and Adam Shook, MapReduce Design Patterns. O’Reilly
Media. Inc, 2012.
22
Thank You
And if you want to know more technical information,
please enter our GitHub repository.
Our project is Open Source.
https://github.com/I-SURF-Hadoop/MapReduce
23
appendix
How does MapReduce really work?
24
How does MapReduce work?
[ Map Phase ]
cat, 1
dog, 1
sees, 1
the, 2
Combining & Sorting
The cat sees the dog, and the dog sees the cat.
the, 1
cat, 1
sees, 1
the, 1
dog, 1
MapReduce library first splits
the input into M pieces.
A map worker processes these
pieces using a user-defined Map
function. Intermediate key/value
pairs will be produced by this
function.
The cat sees the dog
25
How does MapReduce work?
The cat sees the dog, and the dog sees the cat.
sees, 2
the, 4
cat, 2
dog, 2
and, 1
[ Reduce Phase ]
When a reduce worker has read
all intermediate data, it sorts
them by the intermediate keys.
The reduce worker iterates the
sorted intermediate data and for
each unique intermediate key
encountered, it passes the key
and the values to the user’s
Reduce function.
cat, 1
dog, 1
sees, 1
the, 2
cat, 1
dog, 1
sees, 1
the, 2
and, 1
Shuffling
Two independent reducer
26

Contenu connexe

Similaire à [150824]symposium v4

Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Topic 6 IB DP CS
Topic 6 IB DP CSTopic 6 IB DP CS
Topic 6 IB DP CSzion66
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Upodsc
 
Presentation
PresentationPresentation
Presentationbutest
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...Simon Lia-Jonassen
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 

Similaire à [150824]symposium v4 (20)

Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Spark
SparkSpark
Spark
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Cloud accounting software uk
Cloud accounting software ukCloud accounting software uk
Cloud accounting software uk
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Topic 6 IB DP CS
Topic 6 IB DP CSTopic 6 IB DP CS
Topic 6 IB DP CS
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Presentation
PresentationPresentation
Presentation
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 

Plus de yyooooon

#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCIyyooooon
 
about message coalescing
about message coalescingabout message coalescing
about message coalescingyyooooon
 
ffmpeg optimization using CUDA
ffmpeg optimization using CUDAffmpeg optimization using CUDA
ffmpeg optimization using CUDAyyooooon
 
HM10 for presentation
HM10 for presentationHM10 for presentation
HM10 for presentationyyooooon
 
Hm10 Research sheets
Hm10 Research sheetsHm10 Research sheets
Hm10 Research sheetsyyooooon
 
Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker yyooooon
 
MCP3008 & TMP36 을 이용한 온도측정 및 응용
MCP3008 & TMP36 을 이용한 온도측정 및 응용MCP3008 & TMP36 을 이용한 온도측정 및 응용
MCP3008 & TMP36 을 이용한 온도측정 및 응용yyooooon
 
01_라즈베리파이세팅
01_라즈베리파이세팅01_라즈베리파이세팅
01_라즈베리파이세팅yyooooon
 

Plus de yyooooon (8)

#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI
 
about message coalescing
about message coalescingabout message coalescing
about message coalescing
 
ffmpeg optimization using CUDA
ffmpeg optimization using CUDAffmpeg optimization using CUDA
ffmpeg optimization using CUDA
 
HM10 for presentation
HM10 for presentationHM10 for presentation
HM10 for presentation
 
Hm10 Research sheets
Hm10 Research sheetsHm10 Research sheets
Hm10 Research sheets
 
Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker
 
MCP3008 & TMP36 을 이용한 온도측정 및 응용
MCP3008 & TMP36 을 이용한 온도측정 및 응용MCP3008 & TMP36 을 이용한 온도측정 및 응용
MCP3008 & TMP36 을 이용한 온도측정 및 응용
 
01_라즈베리파이세팅
01_라즈베리파이세팅01_라즈베리파이세팅
01_라즈베리파이세팅
 

Dernier

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 

Dernier (20)

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 

[150824]symposium v4

  • 1. Hadoop MapReduce How to Survive Out-of-Memory Errors Member: Yoonseung Choi Soyeong Park Faculty Mentor: Prof. Harry Xu Student Mentor: Khanh Nguyen The International Summer Undergraduate Research Fellowship 1
  • 2. Outline • Introduction • What is MapReduce? • How does MapReduce work? • Limitations of MapReduce • What are our goals? • Operation test • Conclusions 2
  • 3. “There was 5 exabytes of information created between the dawn of civilization through 2003, But that much information is now created every 2 days, and the pace is increasing...” - Eric Schmidt, The Former Google CEO 3
  • 4. Data scientists want to analyze these large data sets But single machines have limitations in processing these data sets How can we handle that? Furthermore, data sets are now growing very rapidly We don’t want to understand parallelization, fault tolerance, data distribution, and load balancing! Distributed processing Therefore, we purpose The ‘MapReduce’ parallelization fault tolerance data distribution load balancing 4
  • 5. MapReduce is a programming model for processing large data sets Many real world tasks are expressible in this model The model is easy to use, even for programmers without experience with parallel and distributed systems [1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. * https://en.wikipedia.org/wiki/Apache_Hadoop MapReduce Layer HDFS Layer 5
  • 6. What is MapReduce? Mapper takes an input and produces a set of intermediate key/value pairs Reducer merges together these intermediate values associated with the same intermediate key [1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. p.12 6
  • 7. How does MapReduce work? The cat sees the dog, and the dog sees the cat. The cat sees the dog Andthedogseesthecat cat, 1 dog, 1 sees, 1 the, 2 cat, 1 dog, 1 sees, 1 the, 2 and, 1 cat, 2 dog, 2 sees, 2 the, 4 and, 1 - Wordcount program - A sentence is split into two map tasks Map Phase Reduce Phase 7
  • 8. Limitations of MapReduce There are many reasons for poor performance And even experts sometimes can’t figure them out 8
  • 9. What are our goals? • Research Out-of-Memory Error(OOM) cases • Document OOM cases • Implement and simulate StackOverflow OOM cases • Develop solutions for such OOM cases … all done!! 9
  • 10. Two Categories 1. Inappropriate Configuration Configuration which causes poor performance 2. Large Intermediate Results Temporary data structure grows too large [3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences. 10
  • 11. Operation test environments 1. Standalone & Pseudo-distributed mode - ‘14 MacBook Pro, 2.8 GHz Intel Core i5 8GB 1600 MHz DDR3, 500GB HDD - ‘12 MacBook Air 1.4, GHz Intel Core i5 4GB 1600 MHz DDR3, 256GB HDD 2. Fully-distributed mode - Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock) 1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet 11
  • 12. Split size variation [Single node] * ‘14 MacBook Pro 2.8, GHz Intel Core i5, 8GB 1600 MHz DDR3, 500GB SSD Input: StackOverflow’s users profiles (1GB) 173.3 88.3 47.3 26.7 24.3 204 117.3 86.3 64.7 56.3 169.3 117.3 78.7 59 55 0 50 100 150 200 16 32 64 128 256 (sec) 169.7 85.7 43 23 23.3 172.7 103.7 64.7 48.7 37.7 129.7 77.7 55 39 32.7 0 50 100 150 200 16 32 64 128 256 [ Distributed grep (no Reducer) ][ Standard deviation of users’ age ] (MB) (sec) (MB) Standalone Pseudo-distributed (2Mapper 2Reducer) Pseudo-distributed (4Mapper 4Reducer) 12
  • 13. [ ] Split size variation [Single node] * ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD Input: StackOverflow’s Comments (8.5GB) 1577.7 807.7 425 411 312.3 1586.3 831 634 454.3 299 1590 803.7 540.3 397.7 323 250 550 850 1150 1450 16 32 64 128 256 Standard deviation of comment’s text length Count Min and Max value Standalone Pseudo-distributed (2Mapper 2Reducer) Pseudo-distributed (4Mapper 4Reducer) 1469 783 398 389.3 281.3 1614 610.7 612 418.7 294.3 1598 609 488 362.7254.3 250 550 850 1150 1450 1750 16 32 64 128 256 [ ] (MB)(MB) (sec) (sec) 13
  • 14. Split size variation [Fully-distributed] Input: StackOverflow’s users profiles (1GB) 375 396 442 548 313 296 350 557 0 200 400 600 800 32 64 128 256 * Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock) 1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet 462.7 428.7 476.7 561.7 604 333.3 303 345 33… 603 0 200 400 600 800 16 32 64 128 256 [ Distributed grep (no Reducer) ][ average users’ age based on countries ] 6 Mapper 12 Mapper (MB)(MB) (sec) (sec) 14
  • 15. io.sort.mb variation [Single node] * ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD Input: StackOverflow’s Comments (8.5GB) Test program: Standard deviation of comment’s text length 872 827 814 798 803.7 661 638.7 632 629.7 629.7 633.7 641 635.7 629.3 629600 700 800 900 20 40 80 160 320 Stand alone Pseudo distributed; 2M2R Pseudo distributed; 4M2R (MB) (sec) 15
  • 16. I am working well with small datasets like 200-500MB. But for datasets above 1GB, I am getting an error like this: * http://stackoverflow.com/questions/23042829/getting-java-heap-space-error-while-running-a-mapreduce-code-for-large-dataset 2. Large Intermediate Results 16
  • 17. Problem Investigation Splited Input files Task 1 Task 2 Task 3 Task 4 Task 5 [K, V] [K, V] [K, V] [K, V] [K, V] The Mapper Intermediate key/value pairs 1.3 GB 4.8 GB almost 1 GB 17
  • 18. Problem Investigation [K, V] [K, V] [K, V] [K, V] [K, V] The Reducer Intermediate key/value pairs 4.8 GB almost 1 GB I just have 1GB heap space! almost 1 GB Java heap can’t contain intermediate data structure 18
  • 19. Configuration was: 1.3GB Input, 256MB Split size, 1024MB Java Heap Space Error: Java heap space 19
  • 20. Summary of Solutions • Modify the configuration parameters • Alter the program’s algorithm : Some alternative solution was suggested from the site -> Succeed with original version failed Configuration ( 256MB Split size & 1024MB Java heap size ) Java Heap size 1024MB 2048MB Split size 128 MB Successful Successful 256 MB Failed Successful 20
  • 21. Conclusions • How to solve the poor performance 1. Adjust ‘split size’ & ‘sort space’ - the more size, the less time to spend 2. Adjust the number of Mapper - Utilize all CPU Cores - Larger number of mapper not always right • If intermediate data structure is too large, - Modify the configuration parameter or - Alter the program’s algorithm 21
  • 22. References [1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. [Online]. Available: http://static.googleusercontent.com/media/research.google.com/ko//archive/mapreduce-osdi04.pdf [2] 한기용, Do it! 직접 해보는 하둡 프로그래밍. Seoul: EasysPublishing, 2013. [3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences. [4] Donald Miner and Adam Shook, MapReduce Design Patterns. O’Reilly Media. Inc, 2012. 22
  • 23. Thank You And if you want to know more technical information, please enter our GitHub repository. Our project is Open Source. https://github.com/I-SURF-Hadoop/MapReduce 23
  • 24. appendix How does MapReduce really work? 24
  • 25. How does MapReduce work? [ Map Phase ] cat, 1 dog, 1 sees, 1 the, 2 Combining & Sorting The cat sees the dog, and the dog sees the cat. the, 1 cat, 1 sees, 1 the, 1 dog, 1 MapReduce library first splits the input into M pieces. A map worker processes these pieces using a user-defined Map function. Intermediate key/value pairs will be produced by this function. The cat sees the dog 25
  • 26. How does MapReduce work? The cat sees the dog, and the dog sees the cat. sees, 2 the, 4 cat, 2 dog, 2 and, 1 [ Reduce Phase ] When a reduce worker has read all intermediate data, it sorts them by the intermediate keys. The reduce worker iterates the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the values to the user’s Reduce function. cat, 1 dog, 1 sees, 1 the, 2 cat, 1 dog, 1 sees, 1 the, 2 and, 1 Shuffling Two independent reducer 26

Notes de l'éditeur

  1. Anteater is so cute
  2. Before the speech spoke at the Techonomy conference(10’) in Lake Tahoe  http://readwrite.com/2010/08/04/google_ceo_schmidt_people_arent_ready_for_the_tech
  3. [1, p.12] – map > emit * ADD AN ANIMATION
  4. - 논문에 써있는 configuration parameter 수 체크 From now on, next contents are little a bit technical. So don’t sleep. Because many programming models which uses MR are generally implemented by managed languages like JAVA or C++ It uses garbage collector and sometimes it make problem
  5. I want to tell you what we are doing now
  6. We research some papers, and there’re some patterns which make an OOM. And we can categorize this patterns into 3 categories.
  7. Show just running time decrease
  8. Show just running time decrease
  9. Why graph grows? Because 256 split size has just 4 map tasks It means 2 of 6 mapper will not work. So we need more bigger
  10. Factor value 가 io.sort.mb의 1/10임을 말로 설명 Show just running time decrease