Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
HDFS presented by VIJAY
1. Hadoop
Distributed File System
(HDFS)
SEMINAR GUIDE
Mr. PRAMOD PAVITHRAN
HEAD OF DIVISION
COMPUTER SCIENCE & ENGINEERING
SCHOOL OF ENGINEERING, CUSAT
PRESENTED BY
VIJAY PRATAP SINGH
REG NO: 12110083
S7, CS-B
ROLL NO: 81
2. CONTENTS
WHAT IS HADOOP
PROJECT COMPONENTS IN HADOOP
MAP/REDUCE
HDFS
ARCHITECTURE
WRITE & READ IN HDFS
GOALS OF HADOOP
COMPARISION WITH OTHER SYSTEMS
CONCLUSION
REFERENCES
6. WHAT IS HADOOP ?
o Hadoop is an open-source software framework .
o Hadoop framework consists on two main layers
● Distributed file system (HDFS)
● Execution engine (MapReduce)
o Supports data-intensive distributed applications.
o Licensed under the Apache v2 license.
o It enables applications to work with thousands of computation-independent
computers and petabytes of data
9. MAP/REDUCE
o Hadoop is the popular open source implementation of map/reduce
o MapReduce is a programming model for processing large data sets
o MapReduce is typically used to do distributed computing on clusters of computers
o MapReduce can take advantage of locality of data, processing data on or near the storage assets to
decrease transmission of data.
oThe model is inspired by the map and reduce functions
o"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes
them to slave nodes. The slave node processes the smaller problem, and passes the answer back to
its master node.
o"Reduce" step: The master node then collects the answers to all the sub-problems and combines
them in some way to form the final output
11. HDFS
Highly scalable file system
◦ 6K nodes and 120PB
◦ Add commodity servers and disks to scale storage and IO bandwidth
Supports parallel reading & processing of data
◦ Optimized for streaming reads/writes of large files
◦ Bandwidth scales linearly with the number of nodes and disks
Fault tolerant & easy management
◦ Built in redundancy
◦ Tolerate disk and node failure
◦ Automatically manages addition/removal of nodes
◦ One operator per 3K nodes
Scalable, Reliable & Manageable
25. PIPELINED WRITE
Client
Rack Awareness
Rack 1:DN 1
Rack 2:DN7,9
Rack 1
Core Switch
Switch SwitchF
DataNode 1
DataNode 9
DataNode 7
Rack 5
BA C
Name Node
A A
A
Block Received
Success
MetaData
File.txt =
Block:
DN: 1,7,9
A
26. HDFS READ
Client
Rack 1
Core Switch
Switch Switch
DataNode 1
DataNode 9
DataNode 7
Rack 5
Name Node
I want to
Read file.txt
Block A
Available at
DataNode
[1,7,9]
A A
A
MetaData
File.txt =
Block:
DN: 1,7,9
A
28. GOALS OF HDFS
Very Large Distributed File System
◦10K nodes, 100 million files, 10PB
Assumes Commodity Hardware
◦Files are replicated to handle hardware failure
◦Detect failures and recover from them
Optimized for Batch Processing
◦Data locations exposed so that computations can move to where data resides
◦Provides very high aggregate bandwidth
33. TO LEARN MORE
Source code
◦http://hadoop.apache.org/version_control.html
◦http://svn.apache.org/viewvc/hadoop/common/trunk/
Hadoop releases
◦http://hadoop.apache.org/releases.html
Contribute to it
◦http://wiki.apache.org/hadoop/HowToContribute
34. CONCLUSION
Hdfs provides a reliable, scalable and manageable solution for
working with huge amounts of data
Future secure
Hdfs has been deployed in clusters of 10 to 4k datanodes
◦Used in production at companies such as yahoo! , FB , Twitter , ebay
◦Many enterprises including financial companies use hadoop
35. REFERENCES
[1] M. Zukowski, S. Heman, N. Nes, And P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing In A DBMS. In
VLDB ’07: Proceedings Of The 33rd International Conference On Very Large Data Bases, Pages 23–34, 2007.
[2] Tom White, Hadoop The Definite Guide, O’reilly Media ,Third Edition, May 2012
[3] Jeffrey Shafer, Scott Rixner, And Alan L. Cox, The Hadoop Distributed Filesystem: Balancing Portability And
Performance, Rice University, Houston, TX
[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System,
Yahoo, Sunnyvale, California, USA
[5] Jens Dittrich, Jorge-arnulfo Quian, E-ruiz, Information Systems Group, Efficient Big Data Processing In
Hadoop Mapreduce , Saarland University