Apache Hadoop HDFS Architecture and Resilience

Apache Hadoop HDFS
● What is it ?
● What is it for ?
● Architecture
● Resilience
● Administration
● Data access
● Future changes ?

HDFS – What is it ?
● HDSF = Hadoop Distributed File System
● It is a distributed file system
● Runs on low cost hardware
● It is open source
● Written in Java
● Fault tolerant
● Designed for very large data sets
● Tuned for high throughput

HDFS – What is it for ?
● Designed for batch processing
● Streaming access to data
● Large data sizes i.e. Terabytes
● Highly reliable using data replication
● Supports very large node clusters
● Supports large files
● Supports file numbers into millions

HDFS – Architecture
● Has a master / slave architecture
● A master NameNode
– Controls file system operations
– Maps data blocks to DataNodes
– Logs all changes
● Slave DataNodes
– Store file blocks
– Store replicated data

HDFS – Resilience
● Data is replicated across DataNodes
● Nodes may fail but data is still available
● DataNodes indicate state via heart beat report
● Single point of failure in master NameNode
● Data integrity via check sums

HDFS – Administration
● Access via Java API
● FS Shell commands language
● HTTP browser
● C wrapper for Java API
● Space reclamation
– Via control of replication factor
– Deleted files sent to trash folder
– Trash folder cleaned after configurable time

HDFS – Future changes
Things they might consider for HDFS
● File append
● User quotas
● File links
● Stand by nodes

Other Areas
● Want to know about ?
– Big Data
– Nutch
– Solr
● see my other presentations

Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

Apache Hadoop HDFS Architecture and Resilience

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Apache Hadoop HDFS Architecture and Resilience

Similaire à Apache Hadoop HDFS Architecture and Resilience (20)

Plus de Mike Frampton

Plus de Mike Frampton (20)

Dernier

Dernier (20)

Apache Hadoop HDFS Architecture and Resilience