SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Ing. Vladimír Hanušniak
University of Žilina, March 2014
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
2
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
3
4
With no signs of slowing, Big Data keep growing.
5
 X-ray – 30MB
 3D CT scan – 1GB
 3D MRI – 150MB
 Mammograms – 120MB
 Growing – 20-40%/year
 Preemies health
◦ University of Ontario & IBM
◦ 16 different data streams
◦ 1260 data points per second
 Early treatment
 Data structure and storage
 Analytical methods & Processing power
 Needed parallelization
6
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
7
 Task decomposition (HPC Uniza)
◦ Computationally expensive task
◦ Move the data to processing
◦ Execution order
◦ Shared data storage
8
 Slow HDD read spead !!!
 HDD reading speed ~100MB/s
◦ Read 1000GB => 10000s (166,6 min)
 100 parallel reading machines => 1,6 min
9
 Data decomposition (Hadoop)
◦ Data has regular structure (type, size)
◦ Move processing to data
10
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
11
 Hadoop – framework for processing BigData
 Two main components:
◦ HDFS
◦ MapReduce
 Thousands of nodes in
cluster
12
 Distributed fault-tolerant file system designed
to run on commodity hardware
 Main characteristics
◦ Scalability
◦ High availability
◦ Large files
◦ Common hardware
◦ Streeming data access - write once read many times
13
 NameNode
◦ Master
◦ Control storage
◦ Store metadata about files
 Name, path, size, block size, block IDs, ...
 DataNode
◦ Slave
◦ Store data in blocks
14
15
 Files are stored in blocks
◦ Large files are split
 Size: 64, 128, 256 MB …
 Stored in NameNode memory
◦ Limit factor
 150 bytes per file/directory or block object
◦ 3GB of memory = 10 million one blocks files.
16
 Seek time - 10ms seek time
 Block size - 100 MB 1% of
 Transfer rate - 100 MB/s transfer time
 Number of Map & Reduce Jobs depends on
block size
17
18
19
 First – same node, client
 Second – off-rack
 Third – same rack, different node
 Next… - random nodes (tries to avoid placing
too many replicas on the same rack)
20
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
21
 Programing model for data processing
◦ Functional programming - directed acyclic graph
 Hadoop support: Java, RUBY, Python, C++
 Associative array
◦ <key,value> pairs
22
 Job - unit of work
◦ Input data
◦ Map & Reduce program
◦ Configuration information
 Job is divided into task
◦ Map tasks
◦ Reduce tasks
23
 Job tracker
◦ Coordinates all jobs by scheduling tasks to run on
task trackers
◦ Keeps job progress records
◦ Reschedule task in case of fails
 Task trackers
◦ Run tasks
◦ Send progress report to Jobtracker
24
25
 Hadoop divide input to MapReduce job into
fixed-size piece of work – input split
 Create one map per split
◦ Run user define map function
 Split size tends to be the size of an HDFS
block
26
 Data locality optimization
◦ Run the map task on a node where the input data
resides in HDFS.
◦ Data-local (a), rack-local (b), and off-rack (c) map
tasks.
27
 Output - <Key, Value> pairs
 Write to local disk – NOT to HDFS !!!
◦ Map output is processed by reduce tasks to
produce final output
◦ No replicas needed
 Sort <Key, Value> pairs
 If node fails before reduce –> map again
28
 TaskTracker read region files remotely (RPC)
 Invoke Reduce function (aggregate)
 Output is stored in HDFS
 Don’t have the advantage of data locality
◦ Input to reduce – output from all mappers
29
30
 Minimize the data transferred between map
and reduce tasks
 Run on the map output
 “Reduce on Map side”
 max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
 mean(0, 20, 10, 25, 15) = 14
 mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
31
 Java (RUBY, Python, C++)
◦ Good for programmers
 Pig
◦ Scripting language with a focus on dataflows.
◦ Use Pig Latin language
◦ Allow merging, filtering, applying functions
 Hive
◦ Use HiveQL - similar to SQL (use Facebook)
◦ Provides a database query interface
 Hbase
32
33
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
34
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
 Data Set
 Find the maximum temperature by year
1901 - 317
1902 - 244
1903 - 289
1904 - 256
1905 - 283
...
35
#!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`"t"
gunzip -c $year | 
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
done
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
...
36
37
 Run parts of the program in parallel
◦ Process different years in different processes
 Problems
◦ Non equal-size pieces
◦ Combining partial results need processing time
◦ Single machine processing limit
◦ Long processing time
38
39
40
41
42
Zdroj: Infoware 1-2/2014
43
 Hadoop: The Definitive Guide, 3rd Edition
◦ http://it-ebooks.info/book/635/
 Big Data: A Revolution That Will Transform
How We Live, Work, and Think
 http://hadoop.apache.org/
 http://architects.dzone.com/articles/how-
hadoop-mapreduce-works
44

Contenu connexe

Tendances

AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsKeeyong Han
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?Dmitri Shiryaev
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Partners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with HadoopPartners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with HadoopStu Hood
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works BestEDB
 

Tendances (20)

AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Partners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with HadoopPartners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with Hadoop
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
 

En vedette

What is big data
What is big dataWhat is big data
What is big dataCnu Federer
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
What is hadoop and how it works?
What is hadoop and how it works?What is hadoop and how it works?
What is hadoop and how it works?Cnu Federer
 
Putting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterprisePutting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterpriseDataWorks Summit
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache CassandraMike Frampton
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...TheInevitableCloud
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkesgfawkesnew2
 
Imaging in diagnosis and treatment of carcinoma cervix
Imaging in diagnosis and treatment of carcinoma cervixImaging in diagnosis and treatment of carcinoma cervix
Imaging in diagnosis and treatment of carcinoma cervixJagadesan Pandjatcharam
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2IMC Institute
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)UzmaRuhy
 
ppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGYppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGYtanshu singh
 

En vedette (16)

Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
What is big data
What is big dataWhat is big data
What is big data
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
What is hadoop and how it works?
What is hadoop and how it works?What is hadoop and how it works?
What is hadoop and how it works?
 
Putting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterprisePutting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The Enterprise
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache Cassandra
 
Ct scan
Ct scanCt scan
Ct scan
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkes
 
Imaging in diagnosis and treatment of carcinoma cervix
Imaging in diagnosis and treatment of carcinoma cervixImaging in diagnosis and treatment of carcinoma cervix
Imaging in diagnosis and treatment of carcinoma cervix
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Nuclear Weapons
Nuclear WeaponsNuclear Weapons
Nuclear Weapons
 
Hyperloop
HyperloopHyperloop
Hyperloop
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)
 
ppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGYppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGY
 

Similaire à Hadoop - How It Works

Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 

Similaire à Hadoop - How It Works (20)

Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop2
Hadoop2Hadoop2
Hadoop2
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Hadoop - How It Works

  • 1. Ing. Vladimír Hanušniak University of Žilina, March 2014
  • 2.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 2
  • 3.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 3
  • 4. 4 With no signs of slowing, Big Data keep growing.
  • 5. 5  X-ray – 30MB  3D CT scan – 1GB  3D MRI – 150MB  Mammograms – 120MB  Growing – 20-40%/year  Preemies health ◦ University of Ontario & IBM ◦ 16 different data streams ◦ 1260 data points per second  Early treatment
  • 6.  Data structure and storage  Analytical methods & Processing power  Needed parallelization 6
  • 7.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 7
  • 8.  Task decomposition (HPC Uniza) ◦ Computationally expensive task ◦ Move the data to processing ◦ Execution order ◦ Shared data storage 8
  • 9.  Slow HDD read spead !!!  HDD reading speed ~100MB/s ◦ Read 1000GB => 10000s (166,6 min)  100 parallel reading machines => 1,6 min 9
  • 10.  Data decomposition (Hadoop) ◦ Data has regular structure (type, size) ◦ Move processing to data 10
  • 11.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 11
  • 12.  Hadoop – framework for processing BigData  Two main components: ◦ HDFS ◦ MapReduce  Thousands of nodes in cluster 12
  • 13.  Distributed fault-tolerant file system designed to run on commodity hardware  Main characteristics ◦ Scalability ◦ High availability ◦ Large files ◦ Common hardware ◦ Streeming data access - write once read many times 13
  • 14.  NameNode ◦ Master ◦ Control storage ◦ Store metadata about files  Name, path, size, block size, block IDs, ...  DataNode ◦ Slave ◦ Store data in blocks 14
  • 15. 15
  • 16.  Files are stored in blocks ◦ Large files are split  Size: 64, 128, 256 MB …  Stored in NameNode memory ◦ Limit factor  150 bytes per file/directory or block object ◦ 3GB of memory = 10 million one blocks files. 16
  • 17.  Seek time - 10ms seek time  Block size - 100 MB 1% of  Transfer rate - 100 MB/s transfer time  Number of Map & Reduce Jobs depends on block size 17
  • 18. 18
  • 19. 19
  • 20.  First – same node, client  Second – off-rack  Third – same rack, different node  Next… - random nodes (tries to avoid placing too many replicas on the same rack) 20
  • 21.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 21
  • 22.  Programing model for data processing ◦ Functional programming - directed acyclic graph  Hadoop support: Java, RUBY, Python, C++  Associative array ◦ <key,value> pairs 22
  • 23.  Job - unit of work ◦ Input data ◦ Map & Reduce program ◦ Configuration information  Job is divided into task ◦ Map tasks ◦ Reduce tasks 23
  • 24.  Job tracker ◦ Coordinates all jobs by scheduling tasks to run on task trackers ◦ Keeps job progress records ◦ Reschedule task in case of fails  Task trackers ◦ Run tasks ◦ Send progress report to Jobtracker 24
  • 25. 25
  • 26.  Hadoop divide input to MapReduce job into fixed-size piece of work – input split  Create one map per split ◦ Run user define map function  Split size tends to be the size of an HDFS block 26
  • 27.  Data locality optimization ◦ Run the map task on a node where the input data resides in HDFS. ◦ Data-local (a), rack-local (b), and off-rack (c) map tasks. 27
  • 28.  Output - <Key, Value> pairs  Write to local disk – NOT to HDFS !!! ◦ Map output is processed by reduce tasks to produce final output ◦ No replicas needed  Sort <Key, Value> pairs  If node fails before reduce –> map again 28
  • 29.  TaskTracker read region files remotely (RPC)  Invoke Reduce function (aggregate)  Output is stored in HDFS  Don’t have the advantage of data locality ◦ Input to reduce – output from all mappers 29
  • 30. 30
  • 31.  Minimize the data transferred between map and reduce tasks  Run on the map output  “Reduce on Map side”  max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25  mean(0, 20, 10, 25, 15) = 14  mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15 31
  • 32.  Java (RUBY, Python, C++) ◦ Good for programmers  Pig ◦ Scripting language with a focus on dataflows. ◦ Use Pig Latin language ◦ Allow merging, filtering, applying functions  Hive ◦ Use HiveQL - similar to SQL (use Facebook) ◦ Provides a database query interface  Hbase 32
  • 33. 33
  • 34.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 34
  • 35. (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004...0500001N9+00781+99999999999...)  Data Set  Find the maximum temperature by year 1901 - 317 1902 - 244 1903 - 289 1904 - 256 1905 - 283 ... 35
  • 36. #!/usr/bin/env bash for year in all/* do echo -ne `basename $year .gz`"t" gunzip -c $year | awk '{ temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1); if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp } END { print max }' done % ./max_temperature.sh 1901 317 1902 244 1903 289 1904 256 1905 283 ... 36
  • 37. 37  Run parts of the program in parallel ◦ Process different years in different processes  Problems ◦ Non equal-size pieces ◦ Combining partial results need processing time ◦ Single machine processing limit ◦ Long processing time
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41
  • 43. 43
  • 44.  Hadoop: The Definitive Guide, 3rd Edition ◦ http://it-ebooks.info/book/635/  Big Data: A Revolution That Will Transform How We Live, Work, and Think  http://hadoop.apache.org/  http://architects.dzone.com/articles/how- hadoop-mapreduce-works 44