SlideShare une entreprise Scribd logo
1  sur  32
Hadoop Architecture
Agenda
• Different Hadoop daemons & its roles
• How does a Hadoop cluster look like
• Under the Hood:- How does it write a file
• Under the Hood:- How does it read a file
• Under the Hood:- How does it replicate the file
• Under the Hood:- How does it run a job
• How to balance an un-balanced hadoop cluster
Hadoop – A bit of background
• It’s an open source project
• Based on 2 technical papers published by Google
• A well known platform for distributed applications
• Easy to scale-out
• Works well with commodity hard wares(not entirely true)
• Very good for background applications
Hadoop Architecture
• Two Primary components
 Distributed File System (HDFS): It deals with file
operations like read, write, delete & etc
 Map Reduce Engine: It deals with parallel computation
Hadoop Distributed File System
• Runs on top of existing file system
• A file broken into pre-defined equal sized blocks & stored
individually
• Designed to handle very large files
• Not good for huge number of small files
Map Reduce Engine
• A Map Reduce Program consists of map and reduce functions
• A Map Reduce job is broken into tasks that run in parallel
• Prefers local processing if possible
Hadoop Server Roles
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
Data Node &
Task Tracker
slaves
masters
Clients
Name NodeJob Tracker
Secondary
Name Node
Map Reduce HDFS
Distributed Data Analytics Distributed Data Storage
Hadoop ClusterHadoop Cluster
Rack 1
DN + TT
DN + TT
DN + TT
DN + TT
Name Node
Rack 2
DN + TT
DN + TT
DN + TT
DN + TT
Job Tracker
Rack 3
DN + TT
DN + TT
DN + TT
DN + TT
Secondary NN
Rack 4
DN + TT
DN + TT
DN + TT
DN + TT
Client
Rack N
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
DN + TT
World
switch switch switch switch switch
switch switch
BRAD HEDLUND .com
Typical Workflow
Typical Workflow
• Load data into the cluster (HDFS writes)
• Analyze the data (Map Reduce)
• Store results in the cluster (HDFS writes)
• Read the results from the cluster (HDFS reads)
How many times did our customers type the word
“Fraud” into emails sent to customer service?
Sample Scenario:
File.txt
Huge file containing all emails sent
to customer service
BRAD HEDLUND .com
Writing files to HDFS
• Client consults Name Node
• Client writes block directly to one Data Node
• Data Nodes replicates block
• Cycle repeats for next block
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Client
I want to write
Blocks A,B,C of
File.txt
OK. Write to
Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Blk A Blk B Blk C
switch
switchswitch
Preparing HDFS writes
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
I want to write
File.txt
Block A
OK. Write to
Data Nodes
1,5,6
Blk A Blk B Blk C
File.txt
Ready
Data Nodes
5,6
Ready?
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 5
Data Node 6
Ready
Data Node 6
Rack aware
Ready! • Name Node picks
two nodes in the
same rack, one
node in a different
rack
• Data protection
• Locality for M/R
Ready!
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 5
Data Node 6
Client
Blk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 5
Data Node 6
A A
A
• Data Nodes 1
& 2 pass data
along as its
received
• TCP 50010
Rack aware
BRAD HEDLUND .com
switch
switchswitch
Pipelined Write
Name Node
Data Node 1 Data Node 2
Data Node 3
Client
Blk A Blk B Blk C
File.txt
Rack 1 Rack 5
Rack 1:
Data Node 1
Rack 5:
Data Node 2
Data Node 3
A A
A
Block received
Success
File.txt
Blk A:
DN1, DN2, DN3
BRAD HEDLUND .com
Multi-block Replication Pipeline
Data Node 1 Data Node X
Data Node 3
Rack 1
switch
switch
Client
switch
Blk A Blk A
Blk A
Rack 4
Data Node 2
Rack 5
switch
Data Node Y
Data Node Z
Blk B
Blk B
Blk B
Data Node W
Blk C
Blk C
Blk C
Blk A Blk B Blk C
File.txt 1TB File =
3TB storage
3TB network traffic
BRAD HEDLUND .com
Client writes Span the HDFS Cluster
Client
Rack N
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
File.txt
Rack 4
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 3
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
• Block size
• File Size
Factors:
More blocks = Wider spread
BRAD HEDLUND .com
Hadoop Rack Awareness – Why?
Name Node
File.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node 5
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node 7
Data Node 8
switch
Rack 9
Data Node 9
Data Node 10
Data Node 11
Data Node 12
switch
• Never loose all data if entire rack fails
• Keep bulky flows in-rack when possible
• Assumption that in-rack is higher bandwidth,
lower latency
A
A
BB
B
C C
C
switch
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Rack 5:
Data Node 5
Data Node 6
Data Node 7
Rack aware
BRAD HEDLUND .com
Name Node
• Data Node sends Heartbeats
• Every 10th heartbeat is a Block report
• Name Node builds metadata from Block reports
• TCP – every 3 seconds
• If Name Node is down, HDFS is down
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node N
A AA CC
DN1: A,C
DN2: A,C
DN3: A,C
metadata
File.txt = A,C
C
File system
Awesome!
Thanks.
I’m 
alive!
I have
blocks:
A, C
BRAD HEDLUND .com
Re-replicating missing replicas
Name Node
Data Node 1 Data Node 2 Data Node 3 Data Node 8
A AA CC
DN1: A,C
DN2: A,C
DN3: A, C
metadata
Rack1: DN1, DN2
Rack5: DN3,
Rack9: DN8
C
Rack Awareness
Uh Oh!
Missing
replicas
Copy
blocks A,C
to Node 8
A C
• Missing Heartbeats signify lost Nodes
• Name Node consults metadata, finds affected data
• Name Node consults Rack Awareness script
• Name Node tells a Data Node to re-replicate
BRAD HEDLUND .com
Secondary Name Node
• Not a hot standby for the Name Node
• Connects to Name Node every hour*
• Housekeeping, backup of Name Node metadata
• Saved metadata can rebuild a failed Name Node
Name Node
metadata
File.txt = A,C
File system
Secondary
Name Node
Its been an hour,
give me your
metadata
BRAD HEDLUND .com
Client reading files from HDFS
Name Node
Client
Tell me the
block locations
of Results.txt
Blk A = 1,5,6
Blk B = 8,1,2
Blk C = 5,8,9
Results.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Client receives Data Node list for each block
• Client picks first Data Node for each block
• Client reads blocks sequentially
A
A
BB
B
C C
C
BRAD HEDLUND .com
Data Processing: Map
• Map: “Run this computation on  your local data”
• Job Tracker delivers Java code to Nodes with local data
Map Task Map Task Map Task
A B C
Client
How many
times does
“Fraud” appear 
in File.txt?
Count
“Fraud”
in Block C
File.txt
Fraud = 3 Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1 Data Node 5 Data Node 9
BRAD HEDLUND .com
switchswitchswitch
What if d ta isn’t local?
• Job Tracker tries to select Node in same rack as data
• Name Node rack awareness
“I need block A”
Map Task Map Task
B C
Client
How many
times does
“Fraud” appear 
in File.txt?
Count
“Fraud”
in Block C
Fraud = 0 Fraud = 11
Job TrackerName Node
Data Node 1
Data Node 5 Data Node 9
“no Map tasks left”
A
Data Node 2
Rack 1 Rack 5 Rack 9
BRAD HEDLUND .com
Data Node reading files from HDFS
Name Node
Block A = 1,5,6
File.txt=
Blk A:
DN1, DN5, DN6
Blk B:
DN7, DN1, DN2
Blk C:
DN5, DN8,DN9
metadata
Rack 1
Data Node 1
Data Node 2
Data Node 3
Data Node
switch
A
Rack 5
Data Node 5
Data Node 6
Data Node
Data Node
switch
Rack 9
Data Node 8
Data Node 9
Data Node
Data Node
switch
• Name Node provides rack local Nodes first
• Leverage in-rack bandwidth, single hop
A
A
BB
B
C C
C
Tell me the
locations of
Block A of
File.txt switch
Rack 1:
Data Node 1
Data Node 2
Data Node 3
Rack 5:
Data Node 5
Rack aware
BRAD HEDLUND .com
Data Processing: Reduce
• Reduce: “Run this computation across Map results”
• Map Tasks deliver output data over the network
• Reduce Task data output written to and read from HDFS
Fraud = 0
Job Tracker
Reduce Task
Sum
“Fraud”
Results.txt
Fraud = 14
Map Task Map Task Map Task
A B C
Client
HDFS
X Y Z
Data Node 1 Data Node 5 Data Node 9
Data Node 3
BRAD HEDLUND .com
Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible
• New servers underutilized for Map Reduce, HDFS*
• Might see more network bandwidth, slower job times**
**I was assigned
a Map Task but
don’t have th e  
block. Guess I
need to get it.
*I’m bored!
BRAD HEDLUND .com
Unbalanced Cluster
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Hadoop prefers local processing if possible
• New servers underutilized for Map Reduce, HDFS*
• Might see more network bandwidth, slower job times**
**I was assigned
a Map Task but
don’t have th e  
block. Guess I
need to get it.
*I’m bored!
BRAD HEDLUND .com
Cluster BalancingCluster Balancing
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 2
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch
Rack 1
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
switch
New Rack
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
switch NEW
File.txt
• Balancer utility (if used) runs in the background
• Does not interfere with Map Reduce or HDFS
• Default speed limit 1 MB/s
brad@cloudera-1:~$hadoop balancer
BRAD HEDLUND .com
Quiz
• If you had written a file of size 1TB into
HDFS with replication factor 2, What is
the actual size required by the HDFS to
store this file?
• True/False? Even if Name node goes
down, I still will be able to read files from
HDFS.
Quiz
• True/False? In Hadoop Cluster, We can
have a secondary Job Tracker to enhance
the fault tolerance.
• True/False? If Job Tracker goes down, You
will not be able to write any file into
HDFS.
Quiz
• True/False? Name node stores the actual
data itself.
• True/False? Name node can be re-built using
the secondary name node.
• True/False? If a data node goes down,
Hadoop takes care of re-replicating the
affected data block.
Quiz
• In which scenario, one data node tries to
read data from another data node?
• What are the benefits of Name node’s rack-
awareness?
• True/False? HDFS is well suited for
applications which write huge number of
small files.
Quiz
• True/False? Hadoop takes care of
balancing the cluster automatically?
• True/False? Output of Map tasks are
written to HDFS file?
• True/False? Output of Reduce tasks are
written to HDFS file?
Quiz
• True/False? In production cluster,
commodity hardware can be used to
setup Name node.
• Thank You

Contenu connexe

Tendances

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 

Tendances (18)

HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 

Similaire à Hadoop architecture meetup

Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Networkbradhedlund
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 

Similaire à Hadoop architecture meetup (20)

Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Network
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Os Gottfrid
Os GottfridOs Gottfrid
Os Gottfrid
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

Dernier

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Hadoop architecture meetup

  • 2. Agenda • Different Hadoop daemons & its roles • How does a Hadoop cluster look like • Under the Hood:- How does it write a file • Under the Hood:- How does it read a file • Under the Hood:- How does it replicate the file • Under the Hood:- How does it run a job • How to balance an un-balanced hadoop cluster
  • 3. Hadoop – A bit of background • It’s an open source project • Based on 2 technical papers published by Google • A well known platform for distributed applications • Easy to scale-out • Works well with commodity hard wares(not entirely true) • Very good for background applications
  • 4. Hadoop Architecture • Two Primary components  Distributed File System (HDFS): It deals with file operations like read, write, delete & etc  Map Reduce Engine: It deals with parallel computation
  • 5. Hadoop Distributed File System • Runs on top of existing file system • A file broken into pre-defined equal sized blocks & stored individually • Designed to handle very large files • Not good for huge number of small files
  • 6. Map Reduce Engine • A Map Reduce Program consists of map and reduce functions • A Map Reduce job is broken into tasks that run in parallel • Prefers local processing if possible
  • 7. Hadoop Server Roles Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker Data Node & Task Tracker slaves masters Clients Name NodeJob Tracker Secondary Name Node Map Reduce HDFS Distributed Data Analytics Distributed Data Storage
  • 8. Hadoop ClusterHadoop Cluster Rack 1 DN + TT DN + TT DN + TT DN + TT Name Node Rack 2 DN + TT DN + TT DN + TT DN + TT Job Tracker Rack 3 DN + TT DN + TT DN + TT DN + TT Secondary NN Rack 4 DN + TT DN + TT DN + TT DN + TT Client Rack N DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT World switch switch switch switch switch switch switch BRAD HEDLUND .com
  • 9. Typical Workflow Typical Workflow • Load data into the cluster (HDFS writes) • Analyze the data (Map Reduce) • Store results in the cluster (HDFS writes) • Read the results from the cluster (HDFS reads) How many times did our customers type the word “Fraud” into emails sent to customer service? Sample Scenario: File.txt Huge file containing all emails sent to customer service BRAD HEDLUND .com
  • 10. Writing files to HDFS • Client consults Name Node • Client writes block directly to one Data Node • Data Nodes replicates block • Cycle repeats for next block Name Node Data Node 1 Data Node 5 Data Node 6 Data Node N Client I want to write Blocks A,B,C of File.txt OK. Write to Data Nodes 1,5,6 Blk A Blk B Blk C File.txt Blk A Blk B Blk C
  • 11. switch switchswitch Preparing HDFS writes Name Node Data Node 1 Data Node 5 Data Node 6 Client I want to write File.txt Block A OK. Write to Data Nodes 1,5,6 Blk A Blk B Blk C File.txt Ready Data Nodes 5,6 Ready? Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Ready Data Node 6 Rack aware Ready! • Name Node picks two nodes in the same rack, one node in a different rack • Data protection • Locality for M/R Ready! BRAD HEDLUND .com
  • 12. switch switchswitch Pipelined Write Name Node Data Node 1 Data Node 5 Data Node 6 Client Blk A Blk B Blk C File.txt Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 A A A • Data Nodes 1 & 2 pass data along as its received • TCP 50010 Rack aware BRAD HEDLUND .com
  • 13. switch switchswitch Pipelined Write Name Node Data Node 1 Data Node 2 Data Node 3 Client Blk A Blk B Blk C File.txt Rack 1 Rack 5 Rack 1: Data Node 1 Rack 5: Data Node 2 Data Node 3 A A A Block received Success File.txt Blk A: DN1, DN2, DN3 BRAD HEDLUND .com
  • 14. Multi-block Replication Pipeline Data Node 1 Data Node X Data Node 3 Rack 1 switch switch Client switch Blk A Blk A Blk A Rack 4 Data Node 2 Rack 5 switch Data Node Y Data Node Z Blk B Blk B Blk B Data Node W Blk C Blk C Blk C Blk A Blk B Blk C File.txt 1TB File = 3TB storage 3TB network traffic BRAD HEDLUND .com
  • 15. Client writes Span the HDFS Cluster Client Rack N Data Node Data Node Data Node Data Node Data Node Data Node switch File.txt Rack 4 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 3 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch • Block size • File Size Factors: More blocks = Wider spread BRAD HEDLUND .com
  • 16. Hadoop Rack Awareness – Why? Name Node File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node 3 Data Node 5 switch A Rack 5 Data Node 5 Data Node 6 Data Node 7 Data Node 8 switch Rack 9 Data Node 9 Data Node 10 Data Node 11 Data Node 12 switch • Never loose all data if entire rack fails • Keep bulky flows in-rack when possible • Assumption that in-rack is higher bandwidth, lower latency A A BB B C C C switch Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7 Rack aware BRAD HEDLUND .com
  • 17. Name Node • Data Node sends Heartbeats • Every 10th heartbeat is a Block report • Name Node builds metadata from Block reports • TCP – every 3 seconds • If Name Node is down, HDFS is down Name Node Data Node 1 Data Node 2 Data Node 3 Data Node N A AA CC DN1: A,C DN2: A,C DN3: A,C metadata File.txt = A,C C File system Awesome! Thanks. I’m  alive! I have blocks: A, C BRAD HEDLUND .com
  • 18. Re-replicating missing replicas Name Node Data Node 1 Data Node 2 Data Node 3 Data Node 8 A AA CC DN1: A,C DN2: A,C DN3: A, C metadata Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8 C Rack Awareness Uh Oh! Missing replicas Copy blocks A,C to Node 8 A C • Missing Heartbeats signify lost Nodes • Name Node consults metadata, finds affected data • Name Node consults Rack Awareness script • Name Node tells a Data Node to re-replicate BRAD HEDLUND .com
  • 19. Secondary Name Node • Not a hot standby for the Name Node • Connects to Name Node every hour* • Housekeeping, backup of Name Node metadata • Saved metadata can rebuild a failed Name Node Name Node metadata File.txt = A,C File system Secondary Name Node Its been an hour, give me your metadata BRAD HEDLUND .com
  • 20. Client reading files from HDFS Name Node Client Tell me the block locations of Results.txt Blk A = 1,5,6 Blk B = 8,1,2 Blk C = 5,8,9 Results.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node Data Node switch A Rack 5 Data Node 5 Data Node 6 Data Node Data Node switch Rack 9 Data Node 8 Data Node 9 Data Node Data Node switch • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially A A BB B C C C BRAD HEDLUND .com
  • 21. Data Processing: Map • Map: “Run this computation on  your local data” • Job Tracker delivers Java code to Nodes with local data Map Task Map Task Map Task A B C Client How many times does “Fraud” appear  in File.txt? Count “Fraud” in Block C File.txt Fraud = 3 Fraud = 0 Fraud = 11 Job TrackerName Node Data Node 1 Data Node 5 Data Node 9 BRAD HEDLUND .com
  • 22. switchswitchswitch What if d ta isn’t local? • Job Tracker tries to select Node in same rack as data • Name Node rack awareness “I need block A” Map Task Map Task B C Client How many times does “Fraud” appear  in File.txt? Count “Fraud” in Block C Fraud = 0 Fraud = 11 Job TrackerName Node Data Node 1 Data Node 5 Data Node 9 “no Map tasks left” A Data Node 2 Rack 1 Rack 5 Rack 9 BRAD HEDLUND .com
  • 23. Data Node reading files from HDFS Name Node Block A = 1,5,6 File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 metadata Rack 1 Data Node 1 Data Node 2 Data Node 3 Data Node switch A Rack 5 Data Node 5 Data Node 6 Data Node Data Node switch Rack 9 Data Node 8 Data Node 9 Data Node Data Node switch • Name Node provides rack local Nodes first • Leverage in-rack bandwidth, single hop A A BB B C C C Tell me the locations of Block A of File.txt switch Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Rack aware BRAD HEDLUND .com
  • 24. Data Processing: Reduce • Reduce: “Run this computation across Map results” • Map Tasks deliver output data over the network • Reduce Task data output written to and read from HDFS Fraud = 0 Job Tracker Reduce Task Sum “Fraud” Results.txt Fraud = 14 Map Task Map Task Map Task A B C Client HDFS X Y Z Data Node 1 Data Node 5 Data Node 9 Data Node 3 BRAD HEDLUND .com
  • 25. Unbalanced Cluster New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Hadoop prefers local processing if possible • New servers underutilized for Map Reduce, HDFS* • Might see more network bandwidth, slower job times** **I was assigned a Map Task but don’t have th e   block. Guess I need to get it. *I’m bored! BRAD HEDLUND .com Unbalanced Cluster New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Hadoop prefers local processing if possible • New servers underutilized for Map Reduce, HDFS* • Might see more network bandwidth, slower job times** **I was assigned a Map Task but don’t have th e   block. Guess I need to get it. *I’m bored! BRAD HEDLUND .com
  • 26. Cluster BalancingCluster Balancing New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 2 Data Node Data Node Data Node Data Node Data Node Data Node switch Rack 1 Data Node Data Node Data Node Data Node Data Node Data Node switch NEW switch New Rack Data Node Data Node Data Node Data Node Data Node Data Node switch NEW File.txt • Balancer utility (if used) runs in the background • Does not interfere with Map Reduce or HDFS • Default speed limit 1 MB/s brad@cloudera-1:~$hadoop balancer BRAD HEDLUND .com
  • 27. Quiz • If you had written a file of size 1TB into HDFS with replication factor 2, What is the actual size required by the HDFS to store this file? • True/False? Even if Name node goes down, I still will be able to read files from HDFS.
  • 28. Quiz • True/False? In Hadoop Cluster, We can have a secondary Job Tracker to enhance the fault tolerance. • True/False? If Job Tracker goes down, You will not be able to write any file into HDFS.
  • 29. Quiz • True/False? Name node stores the actual data itself. • True/False? Name node can be re-built using the secondary name node. • True/False? If a data node goes down, Hadoop takes care of re-replicating the affected data block.
  • 30. Quiz • In which scenario, one data node tries to read data from another data node? • What are the benefits of Name node’s rack- awareness? • True/False? HDFS is well suited for applications which write huge number of small files.
  • 31. Quiz • True/False? Hadoop takes care of balancing the cluster automatically? • True/False? Output of Map tasks are written to HDFS file? • True/False? Output of Reduce tasks are written to HDFS file?
  • 32. Quiz • True/False? In production cluster, commodity hardware can be used to setup Name node. • Thank You