5. Growth of Big Data
90%
OF THE
WORLD‘S DATA
WAS CREATED IN
THE LAST
2
YEARS
50X
GROWTH
FROM
2010
TO 2020
6. Who Generates Big Data?
Have you ever wondered how Google, Facebook, or LinkedIn manages to store and utilize the huge data?
Today, it is becoming a problem for all of us to manage such BIG DATA…
9. What is Hadoop?
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
• It is designed to scale out from single servers to thousands of machines,
each offering local computation and storage
• Rather than rely on single server, the library itself is designed to detect
and handle failures at the application layer, so delivering a high-available
service on top of a cluster of computers, each of which may be prone to
failures
10. Hadoop and Its Characteristics
• It is an open source data management technology with scale-
out storage and distributed processing
13. What Is HDFS
• Hadoop Distributed File System
• Stores files in blocks across many nodes in a cluster
• Replicates the blocks across nodes for durability
• Master/Slave architecture
14. HDFS Master
• Name Node
–Runs on a single node as a master process
–Holds file metadata (which blocks are where)
–Directs client access to files in HDFS
–Receives heartbeat and block report from all data nodes
• Secondary Name Node
–Not a hot failover
–Maintains a copy of the Name Node metadata
15. HDFS Slaves
• Data Node
–Generally runs on all nodes in the cluster
–Block creation / replication / deletion / reads
–Takes orders from the Name Node
20. Design of HDFS
• HDFS is designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware
• Very Large Files
–Files in the sizes of hundreds of gigabytes (GB) or terabytes (TB)
• Streaming Data Access
–HDFS is built around the idea that most efficient data processing pattern is a write-
once, read-many-times pattern. A dataset is typically generated or coped from source,
and then various analyses are performed on that dataset over time
• Commodity Hardware
–Commonly available hardware which is cheap (not enterprise level)
21. When Not to Use HDFS
• When you need low-latency access to data
–Applications that require faster data access will not work well with HDFS. HDFS is
optimized for high throughput of data
• When you have lots of small files
–Name node holds the filesystem metadata in memory. Every block, file, directory
occupies around 150 bytes. Hence having large number of small files will cause
burden on the name node
• When random writes in the file is needed
–Files in the HDFS may be written to by a single writer. Writers are always made at the
end of file in append-only fashion. There is no support for multiple writers or for
modifications at random offsets (positions)
22. HDFS Shell
• Easy to use command line interface
• Create, copy, move, and delete files
• Administrative duties – chmod, chown, chgrp
• Set replication factor for a file
• Head, tail, cat to view files
26. MapReduce Components
• Job Tracker
–Coordinates all the jobs run on the system by scheduling tasks
–Keeps a record of overall progress of each job
–If a jobs fails, reschedules the job on a different task tracker
• Task Tracker
–Slave daemon which accepts tasks to be run a block of data
–Sends progress reports as heart beat signals to the job tracker at
regular intervals
39. Reference
• Slides
–HDFS & MapReduce
–Intro to HDFS and MapReduce
–Hadoop and HDFS
• Articles
–Apache Hadoop Official Site
–HDFS Tutorial: Introduction to HDFS & its Features