1. Apache Hadoop HDFS
● What is it ?
● What is it for ?
● Architecture
● Resilience
● Administration
● Data access
● Future changes ?
2. HDFS – What is it ?
● HDSF = Hadoop Distributed File System
● It is a distributed file system
● Runs on low cost hardware
● It is open source
● Written in Java
● Fault tolerant
● Designed for very large data sets
● Tuned for high throughput
3. HDFS – What is it for ?
● Designed for batch processing
● Streaming access to data
● Large data sizes i.e. Terabytes
● Highly reliable using data replication
● Supports very large node clusters
● Supports large files
● Supports file numbers into millions
5. HDFS – Architecture
● Has a master / slave architecture
● A master NameNode
– Controls file system operations
– Maps data blocks to DataNodes
– Logs all changes
● Slave DataNodes
– Store file blocks
– Store replicated data
6. HDFS – Resilience
● Data is replicated across DataNodes
● Nodes may fail but data is still available
● DataNodes indicate state via heart beat report
● Single point of failure in master NameNode
● Data integrity via check sums
7. HDFS – Administration
● Access via Java API
● FS Shell commands language
● HTTP browser
● C wrapper for Java API
● Space reclamation
– Via control of replication factor
– Deleted files sent to trash folder
– Trash folder cleaned after configurable time
8. HDFS – Future changes
Things they might consider for HDFS
● File append
● User quotas
● File links
● Stand by nodes
9. Other Areas
● Want to know about ?
– Big Data
– Nutch
– Solr
● see my other presentations
10. Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems