1. A new way to store and analyze data
Sandesh Deshmane
2. • What is Hadoop?
• Why, Where, When?
• Benefits of Hadoop
• How Hadoop Works?
• Hadoop Architecture
• HDFS
• Hadoop MapReduce
• Installation &
Execution
• Demo
Topics Covered
3. In pioneer days they used oxen for heavy pulling, and when
one ox couldn’t budge a log, they didn’t try to grow a
larger ox. We shouldn’t be trying for bigger computers, but
more systems of computers.
—Grace Hopper
History
4. • Size of digital Universe was estimated 0.18 zeta byte in 2006 and was 3
zeta byte in 2012
1 zeta byte =10^21
bytes=1k exa bytes=1 million petabyte=
1 billion terabytes
• The New York stock exchange generate 1TB data per day
• Facebook stores around 10 billion photos . Around 1 petabyte.
• The internet archive stores 1 peta byte data and its growing
( 20 TB per month).
Background
5. • Created by Douglas Reed Cutting,
• Open-source Apache Software Foundation.
• consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System
(HDFS).
b. High-performance parallel data processing using a technique called
Map Reduce.
• Hadoop is large-scale, high-performance processing jobs — in spite
of system changes or failures.
What is Hadoop?
6. • Need to process 100TB datasets
• On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework
Hadoop, Why?
7. Where
• Batch data processing, not
real-time / user facing (e.g.
Document Analysis and
Indexing, Web Graphs and
Crawling)
• Highly parallel data intensive
distributed applications
• Very large production
deployments
When
• Process lots of unstructured
data
• When your processing can
easily be made parallel
• Running batch jobs is
acceptable
• When you have access to lots
of cheap hardware
Where and When Hadoop?
8. • Runs on cheap commodity hardware
• Automatically handles data replication and node
failure
• It does the hard work – you can focus on processing
data
• Cost Saving and efficient and reliable data
processing
Benefits of Hadoop
9. • Hadoop implements a computational paradigm named
Map/Reduce, where the application is divided into many small
fragments of work, each of which may be executed or re-executed
on any node in the cluster.
• In addition, it provides a distributed file system (HDFS) that
stores data on the compute nodes, providing very high aggregate
bandwidth across the cluster.
• Both Map/Reduce and the distributed file system are designed so
that node failures are automatically handled by the framework.
How Hadoop Works?
10. The Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing
Hadoop Consists of:
• Hadoop Common: The common utilities that support the other
Hadoop subprojects.
• HDFS: A distributed file system that provides high throughput
access to application data.
• MapReduce: A software framework for distributed processing of
large data sets on compute clusters.
Hadoop Architecture
13. • Know as Hadoop Distribute File System
• Primary storage system for Hadoop Apps
• Multiple replicas of data blocks distributed on compute nodes for
reliability
• Files are stored on multiple boxes for durability and high availability
HDFS
14. • Distributed File System = holds large amount of data and
provides access to this data to many clients distributed
across a network . e.g NFS
• HDFS stores large amount of Information than DFS
• HDFS stores data reliably.
• HDFS provides fast, scalable access to this information to
large number of clients in Cluster
DFS vs. HDFS
15. • Optimized for long sequential reads
• Data written once , read multiple times, no append possible
• Large file, sequential reads so no local caching of data.
• Data replication
HDFS
17. • Block Structure files system
• File is divided to bocks and stored
• Each individual machine in cluster is Data Node
• Default block size is 64 MB
• Information of blocks is stored in metadata
• All this meta data is stored on machine which is Name Node
HDFS Architecture
25. • HDFS handles the Distributed File System layer
• MapReduce is how we process the data
• MapReduce Daemons
JobTracker
TaskTracker
• Goals
Distribute the reading and processing of data
Localize the processing when possible
Share as little data as possible while processing
MapReduce
27. • One per cluster “master node”
• Takes jobs from clients
• Splits work into “tasks”
• Distributes “tasks” to TaskTrackers
• Monitors progress, deals with failures
Job Tracker
28. • Many per cluster “slave nodes”
• Does the actual work, executes the code for the job
• Talks regularly with JobTracker
• Launches child process when given a task
• Reports progress of running “task” back to JobTracker
Task Tracker
29. • Client Submits job:
I want to count the count of each word
We will assume that the data to process is already there in HDFS
• Job Tracker receives job
• Queries the NamNode for number of blocks in File
• The job is split into Tasks
• One map task per each block
• As many reduce tasks as specified in the Job
• TaskTracker checks in Regularly with JobTracker
Is there any work for me ?
• If the JobTracker has a MapTask that the TaskTracker has a local block for
the file being processed then the TaskTracker will be given the “task”
Anatomy of Map Reduce Job
36. • Read text files and count how often words occur.
o The input is text files
o The output is a text file
each line: word, tab, count
• Map: Produce pairs of (word, count)
• Reduce: For each word, sum up the counts.
Example of MapReduce - Word Count
37. public static class MapClass extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
Map Class
38. public static class ReduceClass extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Reduce Class
39. public void run(String inputPath, String outputPath) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount"); // the keys are words (strings)
conf.setOutputKeyClass(Text.class); // the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
Driver Class
40. import static org.mockito.Matchers.anyObject;
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.junit.*;
public class WordCountMapperTest {
@Test
public void processesValidRecord() throws IOException {
MapClass mapper = new MapClass ();
Text value = new Text(“test test”)
OutputCollector<Text, IntWritable> output = mock(OutputCollector.class);
mapper.map(null, value, output, null);
verify(output).collect(new Text(“test"), new IntWritable(2));
}
}
Junit For Mapper
41. Junit for Reducer
import static org.mockito.Matchers.anyObject;
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.junit.*;
@Test
public void returnsMaximumIntegerInValues() throws IOException {
ReduceClass reducer = new ReduceClass ();
Text key = new Text(“test");
Iterator<IntWritable> values = Arrays.asList(
new IntWritable(1), new IntWritable(1)).iterator();
OutputCollector<Text, IntWritable> output = mock(OutputCollector.class);
reducer.reduce(key, values, output, null);
verify(output).collect(key, new IntWritable(2));
}
42. Installation :
• Requirements: Linux, Java 1.6,
sshd,
• Configure SSH for password-free
authentication
• Unpack Hadoop distribution
• Edit a few configuration files
• Format the DFS on the name
node
• Start all the daemon processes
Execution:
• Compile your job into a JAR file
• Copy input data into HDFS
• Execute bin/hadoop jar with
relevant args
• Monitor tasks via Web interface
(optional)
• Examine output when job is
complete
Let’s Go…