SlideShare a Scribd company logo
1 of 29
Download to read offline
Hadoop Distributed File System

                Dhruba Borthakur
   Apache Hadoop Project Management Committee
              dhruba@apache.org
             dhruba@facebook.com
Hadoop, Why?
•  Need to process huge datasets on large clusters
   of computers
•  Nodes fail every day
   – Failure is expected, rather than exceptional.
   – The number of nodes in a cluster is not constant.
•  Very expensive to build reliability into each
   application.
•  Need common infrastructure
   – Efficient, reliable, easy to use
   – Open Source, Apache License
Hadoop History
•    Dec 2004 – Google paper published
•    July 2005 – Nutch uses new MapReduce implementation
•    Jan 2006 – Doug Cutting joins Yahoo!
•    Feb 2006 – Hadoop becomes a new Lucene subproject
•    Apr 2007 – Yahoo! running Hadoop on 1000-node cluster
•    Jan 2008 – An Apache Top Level Project
•    Feb 2008 – Yahoo! production search index with Hadoop
•    July 2008 – First reports of a 4000-node cluster
Who uses Hadoop?
•    Amazon/A9
•    Facebook
•    Google
•    IBM
•    Joost
•    Last.fm
•    New York Times
•    PowerSet
•    Veoh
•    Yahoo!
What is Hadoop used for?
•  Search
   –  Yahoo, Amazon, Zvents,
•  Log processing
   –  Facebook, Yahoo, ContextWeb. Joost, Last.fm
•  Recommendation Systems
   –  Facebook
•  Data Warehouse
   –  Facebook, AOL
•  Video and Image Analysis
   –  New York Times, Eyealike
Public Hadoop Clouds
•  Hadoop Map-reduce on Amazon EC2 instances
    –  http://wiki.apache.org/hadoop/AmazonEC2
•  IBM Blue Cloud
   –  Partnering with Google to offer web-scale infrastructure
•  Global Cloud Computing Testbed
   –  Joint effort by Yahoo, HP and Intel
Commodity Hardware



Typically in 2 level architecture
– Nodes are commodity PCs
– 30-40 nodes/rack
– Uplink from rack is 3-4 gigabit
– Rack-internal is 1 gigabit
Goals of HDFS
•  Very Large Distributed File System
   – 10K nodes, 100 million files, 10 PB
•  Assumes Commodity Hardware
   – Files are replicated to handle hardware failure
   – Detect failures and recovers from them
•  Optimized for Batch Processing
   – Data locations exposed so that computations can
   move to where data resides
   – Provides very high aggregate bandwidth
Distributed File System
•  Single Namespace for entire cluster
•  Data Coherency
   – Write-once-read-many access model
   – Client can only append to existing files
•  Files are broken up into blocks
   – Typically 128 MB block size
   – Each block replicated on multiple DataNodes
•  Intelligent Client
   – Client can find location of blocks
   – Client accesses data directly from DataNode
Functions of a NameNode
•  Manages File System Namespace
   – Maps a file name to a set of blocks
   – Maps a block to the DataNodes where
   it resides
•  Cluster Configuration Management
•  Replication Engine for Blocks
NameNode Metadata
•  Meta-data in Memory
   – The entire metadata is in main memory
   – No demand paging of FS meta-data
•  Types of Metadata
   – List of files
   – List of Blocks for each file
   – List of DataNodes for each block
   – File attributes, e.g access time, replication factor
•  A Transaction Log
   – Records file creations, file deletions. etc
DataNode
•  A Block Server
   – Stores data in the local file system (e.g. ext3)
   – Stores meta-data of a block (e.g. CRC)
   – Serves data and meta-data to Clients
•  Block Report
   – Periodically sends a report of all existing blocks to
   the NameNode
•  Facilitates Pipelining of Data
   – Forwards data to other specified DataNodes
Block Placement
•  Current Strategy
   -- One replica on random node on local rack
   -- Second replica on a random remote rack
   -- Third replica on same remote rack
   -- Additional replicas are randomly placed
•  Clients read from nearest replica
•  Would like to make this policy pluggable
Replication Engine
•  NameNode detects DataNode failures
   – Chooses new DataNodes for new
   replicas
   – Balances disk usage
   – Balances communication traffic to
   DataNodes
Data Correctness
•  Use Checksums to validate data
   – Use CRC32
•  File Creation
   – Client computes checksum per 512 byte
   – DataNode stores the checksum
•  File access
   – Client retrieves the data and checksum
   from DataNode
   – If Validation fails, Client tries other replicas
Namenode Failure
•  A single point of failure
•  Transaction Log stored in multiple
   directories
   – A directory on the local file system
   – A directory on a remote file system
   (NFS/CIFS)
•  Need to develop a real HA solution
Data Pipelining
•  Client retrieves a list of DataNodes on
   which to place replicas of a block
•  Client writes block to the first DataNode
•  The first DataNode forwards the data to
   the next DataNode in the Pipeline
•  When all replicas are written, the Client
   moves on to the next block in file
Secondary NameNode
•  Copies FsImage and Transaction Log from
   NameNode to a temporary directory
•  Merges FSImage and Transaction Log into a
   new FSImage in temporary directory
•  Uploads new FSImage to the NameNode
   – Transaction Log on NameNode is purged
User Interface
•  Command for HDFS User:
   – hadoop dfs -mkdir /foodir
   – hadoop dfs -cat /foodir/myfile.txt
   – hadoop dfs -rm /foodir myfile.txt
•  Command for HDFS Administrator
   – hadoop dfsadmin -report
   – hadoop dfsadmin -decommission datanodename
•  Web Interface
   – http://host:port/dfshealth.jsp
Hadoop Map/Reduce
•  Implementation of the Map-Reduce programming model
   – Framework for distributed processing of large data sets
   – Data handled as collections of key-value pairs
   – Pluggable user code runs in generic framework
•  Very common design pattern in data processing
   – Demonstrated by a unix pipeline example:
   cat * | grep | sort | unique -c | cat > file
   input | map | shuffle | reduce | output
•  Natural for:
    – Log processing
    – Web search indexing
    – Ad-hoc queries
Hadoop Subprojects
•  Hive
  –  A Data Warehouse with SQL support
•  HBase
  – table storage for semi-structured data
•  Zookeeper
  – coordinating distributed applications
•  Mahout
  –  Machine learning
Hadoop at Facebook
•  Hardware
  –  4800 cores, 600 machines, 16GB/8GB per
     machine – Nov 2008
  –  4 SATA disks of 1 TB each per machine
  –  2 level network hierarchy, 40 machines per
     rack
Hadoop at Facebook
•  Single HDFS cluster across all cores
   –  2 PB raw capacity
   –  Ingest rate is 2 TB compressed per day
          – 10 TB uncompressed
   –  12 Million files
Hadoop Growth at Facebook
Data Flow at Facebook



Web
Servers
           Scribe
Servers



                                                  Network

                                                  Storage





 Oracle
RAC
   Hadoop
Cluster
           MySQL

Hadoop Usage at Facebook
•  Statistics per day:
   –  55TB of compressed data scanned per day
   –  3200+ jobs on production cluster per day
   –  80M compute minutes per day
•  Barrier to entry is significantly reduced:
   –  SQL like language called Hive
   –  http://hadoop.apache.org/hive/
Useful Links
•  HDFS Design:
  – http://hadoop.apache.org/core/docs/current/hdfs_design.html

•  Hadoop API:
   – http://lucene.apache.org/hadoop/api/
•  Hadoop Wiki
   –  http://wiki.apache.org/hadoop/

More Related Content

What's hot

9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 

What's hot (20)

9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop
HadoopHadoop
Hadoop
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 

Viewers also liked

Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3Sung-jae Park
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_SystemBrett Keim
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
Digital jewellery
Digital jewelleryDigital jewellery
Digital jewelleryChitradevi
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems Ioanna Tsalouchidou
 
Artificial Passenger Sulbha
Artificial Passenger   SulbhaArtificial Passenger   Sulbha
Artificial Passenger SulbhaSulbha Bakshi
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureAngelo Gino Varrati
 
Microsoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvemMicrosoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvemLucas Chies
 

Viewers also liked (20)

Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
presentation_Hadoop_File_System
presentation_Hadoop_File_Systempresentation_Hadoop_File_System
presentation_Hadoop_File_System
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Digital jewellery
Digital jewelleryDigital jewellery
Digital jewellery
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Artificial Passenger Sulbha
Artificial Passenger   SulbhaArtificial Passenger   Sulbha
Artificial Passenger Sulbha
 
Introduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azureIntroduzione al cloud computing e microsoft azure
Introduzione al cloud computing e microsoft azure
 
Amazon S3 Overview
Amazon S3 OverviewAmazon S3 Overview
Amazon S3 Overview
 
Microsoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvemMicrosoft Azure - O poder da nuvem
Microsoft Azure - O poder da nuvem
 

Similar to Hadoop Distributed File System

Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hivesrikanthhadoop
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
 

Similar to Hadoop Distributed File System (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Apache hadoop and hive
Apache hadoop and hiveApache hadoop and hive
Apache hadoop and hive
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

More from elliando dias

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slideselliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScriptelliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structureselliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de containerelliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agilityelliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Librarieselliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Webelliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduinoelliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorceryelliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Designelliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makeselliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Studyelliando dias
 

More from elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Hadoop Distributed File System

  • 1. Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
  • 2. Hadoop, Why? •  Need to process huge datasets on large clusters of computers •  Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. •  Very expensive to build reliability into each application. •  Need common infrastructure – Efficient, reliable, easy to use – Open Source, Apache License
  • 3. Hadoop History •  Dec 2004 – Google paper published •  July 2005 – Nutch uses new MapReduce implementation •  Jan 2006 – Doug Cutting joins Yahoo! •  Feb 2006 – Hadoop becomes a new Lucene subproject •  Apr 2007 – Yahoo! running Hadoop on 1000-node cluster •  Jan 2008 – An Apache Top Level Project •  Feb 2008 – Yahoo! production search index with Hadoop •  July 2008 – First reports of a 4000-node cluster
  • 4. Who uses Hadoop? •  Amazon/A9 •  Facebook •  Google •  IBM •  Joost •  Last.fm •  New York Times •  PowerSet •  Veoh •  Yahoo!
  • 5. What is Hadoop used for? •  Search –  Yahoo, Amazon, Zvents, •  Log processing –  Facebook, Yahoo, ContextWeb. Joost, Last.fm •  Recommendation Systems –  Facebook •  Data Warehouse –  Facebook, AOL •  Video and Image Analysis –  New York Times, Eyealike
  • 6. Public Hadoop Clouds •  Hadoop Map-reduce on Amazon EC2 instances –  http://wiki.apache.org/hadoop/AmazonEC2 •  IBM Blue Cloud –  Partnering with Google to offer web-scale infrastructure •  Global Cloud Computing Testbed –  Joint effort by Yahoo, HP and Intel
  • 7. Commodity Hardware Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit
  • 8. Goals of HDFS •  Very Large Distributed File System – 10K nodes, 100 million files, 10 PB •  Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them •  Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth
  • 9. Distributed File System •  Single Namespace for entire cluster •  Data Coherency – Write-once-read-many access model – Client can only append to existing files •  Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes •  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 10.
  • 11. Functions of a NameNode •  Manages File System Namespace – Maps a file name to a set of blocks – Maps a block to the DataNodes where it resides •  Cluster Configuration Management •  Replication Engine for Blocks
  • 12. NameNode Metadata •  Meta-data in Memory – The entire metadata is in main memory – No demand paging of FS meta-data •  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g access time, replication factor •  A Transaction Log – Records file creations, file deletions. etc
  • 13. DataNode •  A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients •  Block Report – Periodically sends a report of all existing blocks to the NameNode •  Facilitates Pipelining of Data – Forwards data to other specified DataNodes
  • 14. Block Placement •  Current Strategy -- One replica on random node on local rack -- Second replica on a random remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed •  Clients read from nearest replica •  Would like to make this policy pluggable
  • 15. Replication Engine •  NameNode detects DataNode failures – Chooses new DataNodes for new replicas – Balances disk usage – Balances communication traffic to DataNodes
  • 16. Data Correctness •  Use Checksums to validate data – Use CRC32 •  File Creation – Client computes checksum per 512 byte – DataNode stores the checksum •  File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 17. Namenode Failure •  A single point of failure •  Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS) •  Need to develop a real HA solution
  • 18. Data Pipelining •  Client retrieves a list of DataNodes on which to place replicas of a block •  Client writes block to the first DataNode •  The first DataNode forwards the data to the next DataNode in the Pipeline •  When all replicas are written, the Client moves on to the next block in file
  • 19. Secondary NameNode •  Copies FsImage and Transaction Log from NameNode to a temporary directory •  Merges FSImage and Transaction Log into a new FSImage in temporary directory •  Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 20. User Interface •  Command for HDFS User: – hadoop dfs -mkdir /foodir – hadoop dfs -cat /foodir/myfile.txt – hadoop dfs -rm /foodir myfile.txt •  Command for HDFS Administrator – hadoop dfsadmin -report – hadoop dfsadmin -decommission datanodename •  Web Interface – http://host:port/dfshealth.jsp
  • 21. Hadoop Map/Reduce •  Implementation of the Map-Reduce programming model – Framework for distributed processing of large data sets – Data handled as collections of key-value pairs – Pluggable user code runs in generic framework •  Very common design pattern in data processing – Demonstrated by a unix pipeline example: cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output •  Natural for: – Log processing – Web search indexing – Ad-hoc queries
  • 22.
  • 23. Hadoop Subprojects •  Hive –  A Data Warehouse with SQL support •  HBase – table storage for semi-structured data •  Zookeeper – coordinating distributed applications •  Mahout –  Machine learning
  • 24. Hadoop at Facebook •  Hardware –  4800 cores, 600 machines, 16GB/8GB per machine – Nov 2008 –  4 SATA disks of 1 TB each per machine –  2 level network hierarchy, 40 machines per rack
  • 25. Hadoop at Facebook •  Single HDFS cluster across all cores –  2 PB raw capacity –  Ingest rate is 2 TB compressed per day – 10 TB uncompressed –  12 Million files
  • 26. Hadoop Growth at Facebook
  • 27. Data Flow at Facebook Web
Servers
 Scribe
Servers
 Network
 Storage
 Oracle
RAC
 Hadoop
Cluster
 MySQL

  • 28. Hadoop Usage at Facebook •  Statistics per day: –  55TB of compressed data scanned per day –  3200+ jobs on production cluster per day –  80M compute minutes per day •  Barrier to entry is significantly reduced: –  SQL like language called Hive –  http://hadoop.apache.org/hive/
  • 29. Useful Links •  HDFS Design: – http://hadoop.apache.org/core/docs/current/hdfs_design.html •  Hadoop API: – http://lucene.apache.org/hadoop/api/ •  Hadoop Wiki –  http://wiki.apache.org/hadoop/