Big Data Technologies - Hadoop

A new way to store and analyze data
Sandesh Deshmane

• What is Hadoop?
• Why, Where, When?
• Benefits of Hadoop
• How Hadoop Works?
• Hadoop Architecture
• HDFS
• Hadoop MapReduce
• Installation &
Execution
• Demo
Topics Covered

In pioneer days they used oxen for heavy pulling, and when
one ox couldn’t budge a log, they didn’t try to grow a
larger ox. We shouldn’t be trying for bigger computers, but
more systems of computers.
—Grace Hopper
History

• Size of digital Universe was estimated 0.18 zeta byte in 2006 and was 3
zeta byte in 2012
1 zeta byte =10^21
bytes=1k exa bytes=1 million petabyte=
1 billion terabytes
• The New York stock exchange generate 1TB data per day
• Facebook stores around 10 billion photos . Around 1 petabyte.
• The internet archive stores 1 peta byte data and its growing
( 20 TB per month).
Background

• Created by Douglas Reed Cutting,
• Open-source Apache Software Foundation.
• consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System
(HDFS).
b. High-performance parallel data processing using a technique called
Map Reduce.
• Hadoop is large-scale, high-performance processing jobs — in spite
of system changes or failures.
What is Hadoop?

• Need to process 100TB datasets
• On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework
Hadoop, Why?

Where
• Batch data processing, not
real-time / user facing (e.g.
Document Analysis and
Indexing, Web Graphs and
Crawling)
• Highly parallel data intensive
distributed applications
• Very large production
deployments
When
• Process lots of unstructured
data
• When your processing can
easily be made parallel
• Running batch jobs is
acceptable
• When you have access to lots
of cheap hardware
Where and When Hadoop?

• Runs on cheap commodity hardware
• Automatically handles data replication and node
failure
• It does the hard work – you can focus on processing
data
• Cost Saving and efficient and reliable data
processing
Benefits of Hadoop

• Hadoop implements a computational paradigm named
Map/Reduce, where the application is divided into many small
fragments of work, each of which may be executed or re-executed
on any node in the cluster.
• In addition, it provides a distributed file system (HDFS) that
stores data on the compute nodes, providing very high aggregate
bandwidth across the cluster.
• Both Map/Reduce and the distributed file system are designed so
that node failures are automatically handled by the framework.
How Hadoop Works?

The Apache Hadoop project develops open-source software for reliable,
scalable, distributed computing
Hadoop Consists of:
• Hadoop Common: The common utilities that support the other
Hadoop subprojects.
• HDFS: A distributed file system that provides high throughput
access to application data.
• MapReduce: A software framework for distributed processing of
large data sets on compute clusters.
Hadoop Architecture

Web
Servers
Scribe
Servers
Network
Storage
Hadoop ClusterOracle DB MySQL
Hadoop Architecture

• Java
• Python
• Ruby
• C++ (Hadoop Pipes)
Supported Languages

• Know as Hadoop Distribute File System
• Primary storage system for Hadoop Apps
• Multiple replicas of data blocks distributed on compute nodes for
reliability
• Files are stored on multiple boxes for durability and high availability
HDFS

• Distributed File System = holds large amount of data and
provides access to this data to many clients distributed
across a network . e.g NFS
• HDFS stores large amount of Information than DFS
• HDFS stores data reliably.
• HDFS provides fast, scalable access to this information to
large number of clients in Cluster
DFS vs. HDFS

• Optimized for long sequential reads
• Data written once , read multiple times, no append possible
• Large file, sequential reads so no local caching of data.
• Data replication
HDFS

• Block Structure files system
• File is divided to bocks and stored
• Each individual machine in cluster is Data Node
• Default block size is 64 MB
• Information of blocks is stored in metadata
• All this meta data is stored on machine which is Name Node
HDFS Architecture

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://your.server.name.com:9000</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/username/hdfs/data</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/username/hdfs/name</value>
</property>
</configuration>
HDFS Config File

public class HDFSHelloWorld {
public static final String theFilename = "hello.txt";
public static final String message = "Hello, world!n";
public static void main (String [] args) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path filenamePath = new Path(theFilename);
try {
if (fs.exists(filenamePath)) {
// remove the file first
fs.delete(filenamePath);
}
FSDataOutputStream out = fs.create(filenamePath);
out.writeUTF(message);
out.close();
FSDataInputStream in = fs.open(filenamePath);
String messageIn = in.readUTF();
System.out.print(messageIn);
in.close();
} catch (IOException ioe) {
System.err.println("IOException during operation: " + ioe.toString());
System.exit(1);
}
}
Sample Java Code to Read/Write from HDFS

• HDFS handles the Distributed File System layer
• MapReduce is how we process the data
• MapReduce Daemons
JobTracker
TaskTracker
• Goals
Distribute the reading and processing of data
Localize the processing when possible
Share as little data as possible while processing
MapReduce

• One per cluster “master node”
• Takes jobs from clients
• Splits work into “tasks”
• Distributes “tasks” to TaskTrackers
• Monitors progress, deals with failures
Job Tracker

• Many per cluster “slave nodes”
• Does the actual work, executes the code for the job
• Talks regularly with JobTracker
• Launches child process when given a task
• Reports progress of running “task” back to JobTracker
Task Tracker

• Client Submits job:
I want to count the count of each word
We will assume that the data to process is already there in HDFS
• Job Tracker receives job
• Queries the NamNode for number of blocks in File
• The job is split into Tasks
• One map task per each block
• As many reduce tasks as specified in the Job
• TaskTracker checks in Regularly with JobTracker
Is there any work for me ?
• If the JobTracker has a MapTask that the TaskTracker has a local block for
the file being processed then the TaskTracker will be given the “task”
Anatomy of Map Reduce Job

Map Reduce Job – Big Picture

JobTracker Queries Name Node for Block Info

Job tracker Defines Job as Collection of Tasks

Task Trackers Checking in are Assigned tasks

• Read text files and count how often words occur.
o The input is text files
o The output is a text file
 each line: word, tab, count
• Map: Produce pairs of (word, count)
• Reduce: For each word, sum up the counts.
Example of MapReduce - Word Count

public static class MapClass extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
Map Class

public static class ReduceClass extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Reduce Class

public void run(String inputPath, String outputPath) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount"); // the keys are words (strings)
conf.setOutputKeyClass(Text.class); // the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
Driver Class

import static org.mockito.Matchers.anyObject;
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.junit.*;
public class WordCountMapperTest {
@Test
public void processesValidRecord() throws IOException {
MapClass mapper = new MapClass ();
Text value = new Text(“test test”)
OutputCollector<Text, IntWritable> output = mock(OutputCollector.class);
mapper.map(null, value, output, null);
verify(output).collect(new Text(“test"), new IntWritable(2));
}
}
Junit For Mapper

Junit for Reducer
import static org.mockito.Matchers.anyObject;
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.OutputCollector;
import org.junit.*;
@Test
public void returnsMaximumIntegerInValues() throws IOException {
ReduceClass reducer = new ReduceClass ();
Text key = new Text(“test");
Iterator<IntWritable> values = Arrays.asList(
new IntWritable(1), new IntWritable(1)).iterator();
OutputCollector<Text, IntWritable> output = mock(OutputCollector.class);
reducer.reduce(key, values, output, null);
verify(output).collect(key, new IntWritable(2));
}

Installation :
• Requirements: Linux, Java 1.6,
sshd,
• Configure SSH for password-free
authentication
• Unpack Hadoop distribution
• Edit a few configuration files
• Format the DFS on the name
node
• Start all the daemon processes
Execution:
• Compile your job into a JAR file
• Copy input data into HDFS
• Execute bin/hadoop jar with
relevant args
• Monitor tasks via Web interface
(optional)
• Examine output when job is
complete
Let’s Go…

Hadoop Users
• Adobe
• Alibaba
• Amazon
• AOL
• Facebook
• Google
• IBM
Major Contributor
• Apache
• Cloudera
• Yahoo
Hadoop Community

• Apache Hadoop! (http://hadoop.apache.org )
• Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop)
• Free Search by Doug Cutting
(http://cutting.wordpress.com )
• Hadoop and Distributed Computing at Yahoo!
(http://developer.yahoo.com/hadoop )
• Cloudera - Apache Hadoop for the Enterprise
(http://www.cloudera.com )
References

Big Data Technologies - Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big Data Technologies - Hadoop

Similaire à Big Data Technologies - Hadoop (20)

Plus de Talentica Software

Plus de Talentica Software (20)

Dernier

Dernier (20)

Big Data Technologies - Hadoop