2. Agenda
− Meet Hadoop
1 −
−
−
History
Data!
Data Storage and Analysis
− What Hadoop is Not
2 − The Hadoop Distributed File System
− HDFS concept
− Architecture
− Goals
− Command User Interface
3 − MapReduce
− Overview
− How MapReduce works
4 − Practice
− Demo
− Discussion
www.exoplatform.com - Copyright 2012 eXo Platform 2
3. Meet Hadoop
- History
- Data!
- Data Storage and Analysis
- What Hadoop is Not
www.exoplatform.com - Copyright 2012 eXo Platform 3
4. History
www.exoplatform.com - Copyright 2012 eXo Platform 4
5. History
- Hadoop got its start in Nutch. A few of them were attempting to
build an open source web search engine and having trouble
managing computations running on even a handful of computers
- Once Google published its GFS and MapReduce papers, the
route became clear. It'd devised systems to solve precisely the
problems they were having with Nutch. So they started, two of
them, half-time, to try to re-create these systems as a part of
Nutch
- Around that time. Yahoo! got interested, and quickly put
together a team. They split off the distributed computing part of
Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon
grew into a technology that could truly scale to the Web.
www.exoplatform.com - Copyright 2012 eXo Platform 5
6. Data! We live in the data age
www.exoplatform.com - Copyright 2012 eXo Platform 6
7. Data! We live in the data age
www.exoplatform.com - Copyright 2012 eXo Platform 7
8. Data Storage and Analysis
- While the storage capacities of hard drives have increased massively over
the years, access speeds the rate at which data can be read from drivers
have not kept up. Once typical drive from 1990 cloud store 1,370 MB of
data and had a transfer speed of 4.4 MB/s. Over 20 years later, one
terabyte drives are the norm, but the transfer speed is around 100MB/s
- This is a long time to read all data on a single drive and writing is even
slower.
www.exoplatform.com - Copyright 2012 eXo Platform 8
9. Data Storage and Analysis
The obvious way:
- Imagine if we have 100 drivers, each holding one hundredth of the data.
Working in parallel, we could read the data in under two minutes.
- Only using one hundredth of a disk may seem wasteful. But we can store one
hundred datasets, each of which is one terabyte, and provide shared access to
them.
www.exoplatform.com - Copyright 2012 eXo Platform 9
10. Data Storage and Analysis
The problems to solve:
- The first: As soon as you start using many pieces of hardware, the chance that
first
one will fail is fairly high. A common way of avoiding data loss is through
replication: redundant copies of the data are kept by the system so that in the
event of failure, there is another copy available.
- The second: That most analysis tasks need to be able to combine the data in
second
some way; data read from one disk may need to be combine with the data from
any of the other 99 disks. Various distributed systems allow data to be combined
from multiple sources, but doing this correctly is notoriously challenging
With Hadoop:
Hadoop provides: a reliable shared storage and analysis system. The storage is
provided by HDFS and analysis by MapReduce.
www.exoplatform.com - Copyright 2012 eXo Platform 10
11. What Hadoop is Not
- It is not a substitute for a database. Hadoop stores data in files, and dose not
database
index them. If you want to find something, you have to run a MapReduce job
going through all the data. This take time, and mean that you cannot directly use
Hadoop as a substitute for a database. Where Hadoop works is where the data is
too big for a database. With very large datasets, the cost of regenerating indexes
is so high you can't easily index changing data.
- MapReduce is not always the best algorithm. MapReduce is profound idea:
algorithm
talking a simple functional programming operation and applying it, in parallel, to
gigabytes or terabytes of data. But there is a price. For that parallelism, you need
to have each MR operation independent from all the others. If you need to know
everything that has gone before, you have a problem.
- Hadoop and MapReduce is not a place to learn Java programming
- Hadoop is not an ideal place to learn networking error messages
- Hadoop clusters are not a place to learn Unix/Linux system administration
www.exoplatform.com - Copyright 2012 eXo Platform 11
12. The Hadoop Distributed File System
- HDFS Concept
- Architecture
- Goals
- Command Line User Interface
www.exoplatform.com - Copyright 2012 eXo Platform 12
13. HDFS concept
Block:
- A disk has a block size, which is the minimum amount of data that it can read or
write. Filesystem for a single disk build on this by dealing with data in blocks. The
disk blocks are normally 512 bytes.
- HDFS, too, has concept of the block, but it is a much larger unit – 64MB by
default. Like in a filesystem for a single disk, files in HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a filesystem for a
single disk, a file in HDFS that is smaller than a single block does not occupy a
full block's worth of underlying storage.
www.exoplatform.com - Copyright 2012 eXo Platform 13
14. HDFS Concept
NameNode and DataNodes:
- An Hadoop cluster has two types of node operating in a master-worker pattern:
a namenode (the master) and a number of datanodes (workers)
- The NameNode manages the filesystem namespace. It maintains the filesystem
tree and the metadata for all the files and directories in the tree. It executes file
system namespace operations like opening, closing, and renaming files and
directories. It also determines the mapping of blocks to DataNodes.
- DataNodes are the workhorses of the filesystem. They store and retrieve blocks
when they are told to (client or NameNode), and they report back to the
NameNode periodically with list of blocks that they are storing.
www.exoplatform.com - Copyright 2012 eXo Platform 14
17. HDFS Goals
- Hardware Failure: An HDFS instance may consist of hundreds or thousands of
server machines, each storing part of the file system's data. The fact that these are
a huge number of components and that each component has a non-trivial probability
of failure means that some components of HDFS is always non-functional.
Therefore, detection of faults and quick, automatic recovery from them is core
architectural goal of HDFS.
- Large Data Sets: Applications that run on HDFS have large data sets. A typical
file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large
file. It should provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster. It should support ten of millions of files on single instance.
- “Moving Computation is Cheaper than Moving Data”: A computation
requested by an application is much more efficient if it is executed near the data it
operates on. This is especially true when the size of data is huge. This minimizes
network congestion and increases the overall throughput of the system. The
assumption is that it is often better to migrate the computation closer to where the
data is located rather than moving data to where the application running. HDFS
provides interfaces for applications to move themselves closer to where data is
located.
www.exoplatform.com - Copyright 2012 eXo Platform 17
18. Command Line User Interface
www.exoplatform.com - Copyright 2012 eXo Platform 18
19. MapReduce
- Overview
- How MapReduce Works
www.exoplatform.com - Copyright 2012 eXo Platform 19
20. Overview
- Hadoop MapReduce is a software framework for easily writing application which
process vast amounts of data (multi-terabyte data-sets) in parallel on large cluster
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
- A MapReduce job usually splits the input data-sets into independent chunks which
are processed by the map task in a completely parallel manner. The framework
sorts the output of the maps, which are then input to the reduce task. Typically both
task
the input and the output of job are sorted by filesystem. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
- The MapReduce framework consist of a single master JobTracker and one
worker TaskTrackser per cluster-node. The master is responsible for scheduling
the jobs component tasks on the worker, monitoring them and re-executing the
failed tasks. The workers execute the tasks as directly by the manner.
www.exoplatform.com - Copyright 2012 eXo Platform 20