Introduction to Hadoop and MapReduce

Introduction to Hadoop and
MapReduce
Csaba Toth
GDG Fresno Meeting
Date: February 6th, 2014
Location: The Hashtag, Fresno

Agenda
•
•
•
•
•

Big Data
A little history
Hadoop
Map Reduce
Demo: Hadoop with Google Compute Engine
and Google Cloud Storage

Big Data
• Wikipedia: “collection of data sets so large and complex that it
becomes difficult to process using on-hand database management
tools or traditional data processing applications”
• Examples: (Wikibon - A Comprehensive List of Big Data Statistics)
– 100 Terabytes of data is uploaded to Facebook every day
– Facebook Stores, Processes, and Analyzes more than 30 Petabytes of
user generated data
– Twitter generates 12 Terabytes of data every day
– LinkedIn processes and mines Petabytes of user data to power the
"People You May Know" feature
– YouTube users upload 48 hours of new video content every minute of
the day
– Decoding of the human genome used to take 10 years. Now it can be
done in 7 days

Big Data characteristics
• Three Vs: Volume, Velocity, Variety
• Sources:
–
–
–
–

Science, Sensors, Social networks, Log files
Public Data Stores, Data warehouse appliances
Network and in-stream monitoring technologies
Legacy documents

• Main problems:
– Storage Problem
– Money Problem
– Consuming and processing the data

A Little History
Two Seminar papers:
• “The Google File System” - October 2003
http://labs.google.com/papers/gfs.html
– describes a scalable, distributed, fault-tolerant file system
tailored for data-intensive applications, running on inexpensive
commodity hardware, delivers high aggregate performance

• “MapReduce: Simplified Data Processing on Large Clusters”
- April 2004 http://queue.acm.org/detail.cfm?id=988408
– Describes a programming model and an implementation for
processing large data sets.
1.

2.

map function that processes a key/value pair to generate a set of
intermediate key/value pairs
reduce function that merges all intermediate values associated with
the same intermediate key

Hadoop
• Hadoop is an open-source software framework
that supports data-intensive distributed
applications.
• It is written in Java, utilizes JVMs
• Named after it’s creator’s (Doug Cutting, Yahoo)
son’s toy elephant
• Hadoop is managing a cluster of commodity
hardware computers. The cluster is composed of
a single master node and multiple worker nodes

Hadoop vs RDBMS
Hadoop / MapReduce

RDBMS

Size of data

Petabytes

Gigabytes

Integrity of data

Low

High (referential, typed)

Data schema

Dynamic

Static

Access method

Interactive and Batch

Batch

Scaling

Linear

Nonlinear (worse than
linear)

Data structure

Unstructured

Structured

Normalization of data

Not Required

Required

Query Response Time

Has latency (due to batch
processing)

Can be near immediate

MapReduce
• Hadoop leverages the programming model of
map/reduce. It is optimized for processing large data
sets.
• MapReduce is an essential technique to do distributed
computing on clusters of computers/nodes.
• The goal of map reduce is to break huge data sets into
smaller pieces, distribute those pieces to various
worker nodes, and process the data in parallel.
• Hadoop leverages a distributed file system to store the
data on various nodes.

MapReduce
• It is about two functions: map and reduce
1. Map Step:
– it is about dividing the problem into smaller subproblems. A master node has the job of distributing
the work to worker nodes. The worker node just does
one thing and returns the work back to the master
node.

2. Reduce Step:
– Once the master gets the work from the worker
nodes, the reduce step takes over and combines all
the work. By combining the work you can form some
answer and ultimately output.

MapReduce – Map step
• There is a master node and many slave nodes.
• The master node takes the input, divides it into
smaller sub-problems, and distributes the input
to worker or slave nodes. worker node may do
this again in turn, leading to a multi-level tree
structure.
• The worker/slave nodes processes the data into a
smaller problem, and passes the answer back to
its master node.
• Each mapping operation is independent of the
others, all maps can be performed in parallel.

MapReduce – Reduce step
• The master node then collects the answers
from the worker or slave nodes. It then
aggregates the answers and creates the
needed output, which is the answer to the
problem it was originally trying to solve.
• Reducers can also preform the reduction
phase in parallel. That is how the system can
process petabytes in a matter of hours.

Map, Shuffle, and Reduce

https://mm-tom.s3.amazonaws.com/blog/MapReduce.png

Word count

http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png

Hadoop architecture
•
•
•
•

Job Tracker
Task Tracker
Name Node
Data Node

Figures
• Following: some figures from the book
Hadoop: The Definitive Guide, 3rd Edition

A client reading data from HDFS

MapReduce data flow with a single
reduce task

MapReduce data flow with multiple
reduce tasks

Hadoop architecture
Presentation
Layer

Web Browser (JS)

Data Mining
(Pegasus,
Mahout)

Index,
Searches
(Lucene)

DB drivers
(Hive driver)

Advanced Query Engine (Hive, Pig)
Computing Layer (MapReduce)
Storage Layer (HDFS)
Data Integration Layer
Flume

Sqoop

Log Data

RDBMS

Demo
• Google Compute Engine + Google Cloud Storage
• Using Ubuntu as a remote control host
• Following the tutorial of:
– https://github.com/GoogleCloudPlatform/solutions-google-computeengine-cluster-for-hadoop
– Hadoop on Google Compute Engine for Processing Big Data:
https://www.youtube.com/watch?v=se9vV8eIZME

• The example hadoop job is an advanced version of
word count in perl or python: the words are sorted by
length and abc
• Showing also Google Developer Tool web interface

References
• Google’s tutorial (see github and YouTube link of the
Demo)
• Tom White: Hadoop: The Definitive Guide, 3rd Edition,
Yahoo Press
• Lynn Langit’s various presentations and YouTube videos
• Dattatrey Sindol: Big Data Basics - Part 1 - Introduction
to Big Data
• Bruno Terkaly’s presentations (for example Hadoop on
Azure: Introduction)
• Daniel Jebaraj: Ignore HDInsight at Your Own Peril:
Everything You Need to Know

Introduction to Hadoop and MapReduce

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à Introduction to Hadoop and MapReduce

Similaire à Introduction to Hadoop and MapReduce (20)

Plus de Csaba Toth

Plus de Csaba Toth (17)

Dernier

Dernier (20)

Introduction to Hadoop and MapReduce

Notes de l'éditeur