SlideShare une entreprise Scribd logo
1  sur  34
1
Introduction to HDFS
By: Siddharth Mathur
Instructor: Dr. Shiyong Lu
2
Big Data
Wikipedia Definition:
In information technology, big data is a loosely-
defined term used to describe data sets so large
and complex that they become awkward to work
with using on-hand database management tools.
3
How Big is Big Data?
2008: Google processed 20 PB a day
2009: Facebook had 2.5 PB user data + 15
TB/day
2009: eBay had 6.5 PB user data + 50 TB/day
2011: Yahoo! had 180-200 PB of data
2012: Facebook ingests 500 TB/day
4
HOW TO ANALYZE THIS DATA?
5
Divide and Conquer
Partition
Combine
6
But Parallel Processing is complicated
How do we assign tasks to workers?
What if we have more tasks than slots?
What happens when tasks fail?
How do you handle distributed synchronization?
7
The Solution!
Google
File
System
Map
Reduce
BigTable
8
GFS to HDFS
It started when google researchers wrote a
paper on a distributed file system to resolve
storage and analysis issues of Big Data
The researchers proposed a file system named
Google File System which in turn, gave birth to
Hadoop Distributed File System (HDFS)
The paper on MapReduce resulted in
MapReduce programming structure
The paper on BigTable produced Hadoop
Hbase, Data warehouse schema over HDFS
9
HADOOP DISTRIBUTED FILE SYSTEM
10
Key Features
Accesible
Hadoop runs on large clusters of commodity machines or on
cloud computing services such as Amazon's Elastic Compute
Cloud (EC2).
Robust
As Hadoop is intended to run on commodity hardware, It is
architected with the assumption of frequent hardware
malfunctions. It can gracefully handle most such failures.
Scalable
Hadoop scales linearly to handle larger data by adding more
nodes to the cluster.
Simple
Hadoop allows users to quickly write efficient parallel code.
11
HDFS Scaling Out
Performs a task
in 45 minutes
Performs a
task in ~ 45/4
minutes
12
Basic Hadoop Stack
Hadoop Distributed File System
MapReduce
Hbase
Higher Level Languages
13
Hadoop Platforms
Platforms: Unix and on Windows.
Linux: the only supported production platform.
Other variants of Unix, like Mac OS X: run Hadoop for
development.
Windows + Cygwin: development platform (openssh)
Java 6
Java 1.6.x (aka 6.0.x aka 6) is recommended for
running Hadoop.
14
Hadoop Modes
• Standalone (or local) mode
– There are no daemons running and everything runs in
a single JVM. Standalone mode is suitable for running
MapReduce programs during development, since it is
easy to test and debug them.
• Pseudo-distributed mode
– The Hadoop daemons run on the local machine, thus
simulating a cluster on a small scale.
• Fully distributed mode
– The Hadoop daemons run on a cluster of machines.
15
Master-Slave Architecture
Namenode
Jobtracker
Datanode
Tasktracker
Secondary
Namenode
16
Master-Slave Architecture
HDFS has a master-slave architecture.
The master node or the name node governs the cluster.
It takes care of tasks and resource allocation.
It stores all the metadata related to file breakage, block
storage, block replication and task execution status.
The slave nodes or the data nodes are the one which
stores all the data blocks and perform task executions
Tasktracker is the program which runs on each individual
data node and monitors the task execution over each
node.
Jobtracker runs on name node and monitors the
complete job execution.
17
HDFS File Distribution
File metadata
FILE-A -> 1,2,3 (split into 3 blocks)
FILE-B -> 4,5 (split into 2 blocks)
1
3
1
3
Replication factor = 3
Hdfs-site.xml
“ dfs.replication”
4 3
4 4
22
2 5
5
5
Block
1
18
HDFS File Distribution
Name node stores metadata related to:
File split
Block allocation
Task allocation
Each file is split into data blocks. Default size is
64 Mb
Each data block is replicated on different data
node. The replication factor in configurable.
Default value is 3
19
Block Placement
Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
Clients read from nearest replica
20
Rack awareness
DN 1
DN 2
DN 3
DN 4
DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 2 Rack 3
NameNode
File X=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Rack 1 =
DN:1,2,3,4
Rack 2 =
DN:5,6,7,8
Rack 3 =
DN:9,10,11,
12
Switch Switch Switch
Data
block A
Data
block B
FILE X
21
Rack awareness
HDFS is aware of the placement of each data
node and on the racks
To prevent data loss due to a complete rack
failure, Hadoop intelligently replicates each data
block onto other racks also
This helps HDSF to recover the data even if
complete rack of data node shuts down.
This information is stored in the name node.
22
File Write in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Blk C in…..
Switch Switch
Switch
Client
File.txt
[A , B, C]
Broken
down
using
Hadoop
client API
DN 5
DN 6
DN 7
DN 8
Rack 2
Switch
First block
in one rack
next blocks
in different
rack
Intelligent
storage of
data
Heartbeat
Request
Response
MetaData
Creation
Block A Write
23
File Write in Hadoop
HDFS client system requests the name node to
write down a file onto HDFS.
It also provide the file size and other metadata
information to the name node.
Meanwhile, each slave node sends a heartbeat
signal to namenode telling it about their status
24
File Write in Hadoop
The namenode tells the client system where to
store the data blocks
Also, it tells the data node to get ready for data
write.
After the data write procedure is complete the
data node sends a success message to both
client and name node.
25
File Read in Hadoop
DN 1
DN 2
DN 3
DN 4
DN 9
DN 10
DN 11
DN 12
Rack 1 Rack 3
NameNode
File.txt=
Blk:A in
DN:1,5,6
Blk:B in
DN: 7, 10,
11
Blk C in…..
Switch Switch
Switch
Client
DN 5
DN 6
DN 7
DN 8
Rack 2
Switch
An
ordered
list of
nodes.
Heartbeat
Request
Response
26
Re-replicating missing replicas
27
Re-replication
Missing Heartbeats signify lost Nodes
Name Node consults metadata, finds affected
data
Name Node consults Rack Awareness script
Name Node tells the Data node to re-replicate
28
3 main configuration files
Core-site.xml
Contains configuration information that overrides the
default core Hadoop properties
Mapred-site.xml
Contains configuration information that overrides the
default core Mapreduce properties
Also defines the host and port that the MapReduce job
tracker runs at
Hdfs-site.xml
Mainly, to set the block replication factor
29
Anatomy of a Job Launch
30
Job Status updates
31
Limitations of Hadoop -1
Scalability
Maximum Cluster size – 4,000 nodes for best
performance
Maximum Concurrent tasks- 40,000
Name Node as a single point of failure
Failure kills all running and queued jobs
Jobs need to be re-submitted by the user
Re-Start ability
Restart is very tricky due to complex state
32
Who has the biggest cluster setups
Facebook 400
Microsoft 400
LinkedIn 4100
Yahoo 42,000
33
References
http://hadoop.apache.org/
http://research.google.com/archive/mapreduce.html
http://research.google.com/archive/gfs.html
http://research.google.com/archive/bigtable.html
http://hbase.apache.org/
http://wiki.apache.org/hadoop/FAQ
http://matt-
wand.utsacademics.info/webUTSdiscns/HadoopNotes
.pdf
34
THANK YOU

Contenu connexe

Tendances

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on MesosJoe Stein
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with DockerFabio Fumarola
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsRomain Jacotin
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Cosmin Lehene
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Mydbops
 
Postgres connections at scale
Postgres connections at scalePostgres connections at scale
Postgres connections at scaleMydbops
 
Setting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutesSetting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutesSudheer Kondla
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathJoshua McKenzie
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Joe Stein
 
Introduction to DRBD
Introduction to DRBDIntroduction to DRBD
Introduction to DRBDdawnlua
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoverySteven Francia
 
MySQL database replication
MySQL database replicationMySQL database replication
MySQL database replicationPoguttuezhiniVP
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 

Tendances (19)

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
Postgres connections at scale
Postgres connections at scalePostgres connections at scale
Postgres connections at scale
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Setting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutesSetting up mongodb sharded cluster in 30 minutes
Setting up mongodb sharded cluster in 30 minutes
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
 
Introduction to DRBD
Introduction to DRBDIntroduction to DRBD
Introduction to DRBD
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
 
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2HIGH AVAILABLE CLUSTER IN WEB SERVER WITH  HEARTBEAT + DRBD + OCFS2
HIGH AVAILABLE CLUSTER IN WEB SERVER WITH HEARTBEAT + DRBD + OCFS2
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
MySQL database replication
MySQL database replicationMySQL database replication
MySQL database replication
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 

Similaire à Introduction to HDFS

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.Yousef Fadila
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 

Similaire à Introduction to HDFS (20)

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Unit 1
Unit 1Unit 1
Unit 1
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
HADOOP
HADOOPHADOOP
HADOOP
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 

Dernier

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Dernier (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

Introduction to HDFS

  • 1. 1 Introduction to HDFS By: Siddharth Mathur Instructor: Dr. Shiyong Lu
  • 2. 2 Big Data Wikipedia Definition: In information technology, big data is a loosely- defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.
  • 3. 3 How Big is Big Data? 2008: Google processed 20 PB a day 2009: Facebook had 2.5 PB user data + 15 TB/day 2009: eBay had 6.5 PB user data + 50 TB/day 2011: Yahoo! had 180-200 PB of data 2012: Facebook ingests 500 TB/day
  • 4. 4 HOW TO ANALYZE THIS DATA?
  • 6. 6 But Parallel Processing is complicated How do we assign tasks to workers? What if we have more tasks than slots? What happens when tasks fail? How do you handle distributed synchronization?
  • 8. 8 GFS to HDFS It started when google researchers wrote a paper on a distributed file system to resolve storage and analysis issues of Big Data The researchers proposed a file system named Google File System which in turn, gave birth to Hadoop Distributed File System (HDFS) The paper on MapReduce resulted in MapReduce programming structure The paper on BigTable produced Hadoop Hbase, Data warehouse schema over HDFS
  • 10. 10 Key Features Accesible Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon's Elastic Compute Cloud (EC2). Robust As Hadoop is intended to run on commodity hardware, It is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple Hadoop allows users to quickly write efficient parallel code.
  • 11. 11 HDFS Scaling Out Performs a task in 45 minutes Performs a task in ~ 45/4 minutes
  • 12. 12 Basic Hadoop Stack Hadoop Distributed File System MapReduce Hbase Higher Level Languages
  • 13. 13 Hadoop Platforms Platforms: Unix and on Windows. Linux: the only supported production platform. Other variants of Unix, like Mac OS X: run Hadoop for development. Windows + Cygwin: development platform (openssh) Java 6 Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop.
  • 14. 14 Hadoop Modes • Standalone (or local) mode – There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them. • Pseudo-distributed mode – The Hadoop daemons run on the local machine, thus simulating a cluster on a small scale. • Fully distributed mode – The Hadoop daemons run on a cluster of machines.
  • 16. 16 Master-Slave Architecture HDFS has a master-slave architecture. The master node or the name node governs the cluster. It takes care of tasks and resource allocation. It stores all the metadata related to file breakage, block storage, block replication and task execution status. The slave nodes or the data nodes are the one which stores all the data blocks and perform task executions Tasktracker is the program which runs on each individual data node and monitors the task execution over each node. Jobtracker runs on name node and monitors the complete job execution.
  • 17. 17 HDFS File Distribution File metadata FILE-A -> 1,2,3 (split into 3 blocks) FILE-B -> 4,5 (split into 2 blocks) 1 3 1 3 Replication factor = 3 Hdfs-site.xml “ dfs.replication” 4 3 4 4 22 2 5 5 5 Block 1
  • 18. 18 HDFS File Distribution Name node stores metadata related to: File split Block allocation Task allocation Each file is split into data blocks. Default size is 64 Mb Each data block is replicated on different data node. The replication factor in configurable. Default value is 3
  • 19. 19 Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica
  • 20. 20 Rack awareness DN 1 DN 2 DN 3 DN 4 DN 5 DN 6 DN 7 DN 8 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 2 Rack 3 NameNode File X= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Rack 1 = DN:1,2,3,4 Rack 2 = DN:5,6,7,8 Rack 3 = DN:9,10,11, 12 Switch Switch Switch Data block A Data block B FILE X
  • 21. 21 Rack awareness HDFS is aware of the placement of each data node and on the racks To prevent data loss due to a complete rack failure, Hadoop intelligently replicates each data block onto other racks also This helps HDSF to recover the data even if complete rack of data node shuts down. This information is stored in the name node.
  • 22. 22 File Write in Hadoop DN 1 DN 2 DN 3 DN 4 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 3 NameNode File.txt= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Blk C in….. Switch Switch Switch Client File.txt [A , B, C] Broken down using Hadoop client API DN 5 DN 6 DN 7 DN 8 Rack 2 Switch First block in one rack next blocks in different rack Intelligent storage of data Heartbeat Request Response MetaData Creation Block A Write
  • 23. 23 File Write in Hadoop HDFS client system requests the name node to write down a file onto HDFS. It also provide the file size and other metadata information to the name node. Meanwhile, each slave node sends a heartbeat signal to namenode telling it about their status
  • 24. 24 File Write in Hadoop The namenode tells the client system where to store the data blocks Also, it tells the data node to get ready for data write. After the data write procedure is complete the data node sends a success message to both client and name node.
  • 25. 25 File Read in Hadoop DN 1 DN 2 DN 3 DN 4 DN 9 DN 10 DN 11 DN 12 Rack 1 Rack 3 NameNode File.txt= Blk:A in DN:1,5,6 Blk:B in DN: 7, 10, 11 Blk C in….. Switch Switch Switch Client DN 5 DN 6 DN 7 DN 8 Rack 2 Switch An ordered list of nodes. Heartbeat Request Response
  • 27. 27 Re-replication Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells the Data node to re-replicate
  • 28. 28 3 main configuration files Core-site.xml Contains configuration information that overrides the default core Hadoop properties Mapred-site.xml Contains configuration information that overrides the default core Mapreduce properties Also defines the host and port that the MapReduce job tracker runs at Hdfs-site.xml Mainly, to set the block replication factor
  • 29. 29 Anatomy of a Job Launch
  • 31. 31 Limitations of Hadoop -1 Scalability Maximum Cluster size – 4,000 nodes for best performance Maximum Concurrent tasks- 40,000 Name Node as a single point of failure Failure kills all running and queued jobs Jobs need to be re-submitted by the user Re-Start ability Restart is very tricky due to complex state
  • 32. 32 Who has the biggest cluster setups Facebook 400 Microsoft 400 LinkedIn 4100 Yahoo 42,000