EclipseCon Keynote: Apache Hadoop - An Introduction

Apache Hadoop
an introduction

Todd Lipcon
todd@cloudera.com
@tlipcon @cloudera
March 24, 2011

Introductions
Software Engineer at
Apache Hadoop, HBase, Thrift
committer
Previously: systems programming,
operations, large scale data analysis
I love data and data systems

Outline
Why should you care? (Intro)
What is Hadoop?
How does it work?
The Hadoop Ecosystem
Use Cases
Experiences as a developer

Data is the difference.

What‟s data?

Photo by C.C Chapman (CC BY-NC-ND)
http://www.flickr.com/photos/cc_chapman/3342268874/

“Every two days we create as
much information as we did
from the dawn of civilization
up until 2003.”

Eric Schmidt

“I keep saying that the sexy
job in the next 10 years will be
statisticians. And I‟m not
kidding.”

Hal Varian
(Google‟s chief economist)

Are you throwing
away data?
Data comes in many shapes and
sizes: relational tuples, log files,
semistructured textual data (e.g., e-
mail), … .
Are you throwing it away because it
doesn‟t „fit‟?

So, what‟s
Hadoop?

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry

Apache Hadoop is an
open-source system
to reliably store and process
GOBS of data
across many commodity computers.

Two Core
Components
Store Process

HDFS Map/Reduce
Self-healing Fault-tolerant
high-bandwidth distributed
clustered storage. processing.

Falsehood #1: Machines can be reliable…

Image: MadMan the Mighty CC BY-NC-SA

Hadoop separates
distributed system fault-
tolerance code from
application logic.
Unicorns

Systems
Statisticians
Programmers

Falsehood #2: Machines deserve identities...

Image:Laughing Squid CC BY-NC-SA

Hadoop lets you interact
with a cluster, not a
bunch of machines.

Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]

Falsehood #3: Your analysis fits on one machine…

Image: Matthew J. Stinson CC-BY-NC

Hadoop scales linearly
with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:

Extensive machine learning on <100GB of image
data

Simple SQL queries on >100TB of clickstream
data

Hadoop works for both applications!

Hadoop sounds like
magic.

Coincidentally, today is
Houdini‟s birthday, though
he was not a Hadoop
committer.

How is it possible?

A Typical Look...
5-4000 commodity servers
(8-core, 24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per rack

Cluster nodes
Master nodes (1 each)

NameNode (metadata server and database)

JobTracker (scheduler)

Slave nodes (1-4000 each)

DataNodes TaskTrackers
(block storage) (task execution)

HDFS API
FileSystem fs =
FileSystem.get(conf);
InputStream in = fs.open(new
Path(“/foo/bar”));
OutputStream os = fs.create(new
Path(“/baz”));
fs.delete(…), fs.listStatus(…)

HDFS Data Storage
/logs/weblog.txt DN 1
64MB

blk_29232
DN 2
158MB
30MB 64MB

blk_19231

DN 3
blk_329432

NameNode DN 4

• HDFS has split the file into
64MB blocks and stored it on
the DataNodes.

• Now, we want to process that
data.

The MapReduce
Programming
Model

You specify map()
and reduce()
functions.

The framework
does the rest.

map()
map: K₁,V₁→list K₂,V₂
Key: byte offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”

Key: userimage
Value: 2326 bytes

The map function runs on the same node as the data
was stored!

Input Format
• Wait! HDFS is not a Key-Value store!
• InputFormat interprets bytes as a Key
and Value
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700]
"GET /userimage/123 HTTP/1.0" 200 2326

Key: log offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”

The Shuffle

Each map output is assigned to a
“reducer” based on its key

map output is grouped and
sorted by key

reduce()
K₂, iter(V₂)→list(K₃,V₃)
Key: userimage
Value: 2326 bytes (from map task 0001)
Reducer function
Key: userimage
Value: 6346 bytes
TextOutputFormat
userimage t 6346

Hadoop is
not just MapReduce
(NoNoSQL!)

Hive project adds SQL
support to Hadoop
HiveQL (SQL dialect)
compiles to a query plan
Query plan executes as
MapReduce jobs

Hive Example
CREATE TABLE movie_rating_data (
userid INT, movieid INT, rating INT, unixtime STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't„
STORED AS TEXTFILE;

LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE
movie_rating_data;

CREATE TABLE average_ratings AS
SELECT movieid, AVG(rating) FROM movie_rating_data
GROUP BY movieid;

The Hadoop
Ecosystem

(Column DB)

Hadoop in the Wild
(yes, it‟s used in production)

Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ‟09)

Facebook: 15TB new data per day;
1200 machines, 21PB in one cluster

Twitter: >1TB per day, ~120 nodes

Lots of 5-40 node clusters at companies without
petabytes of data (web, retail, finance, telecom,
research)

Product
Recommendations
• Naïve approach: Users who bought toothpaste bought
toothbrushes.

• Hadoop approach

• What products did a user browse, hover over, rate,
add to cart (but not buy), etc in the last 2 months?

• What are the attributes of the user?

• What are our margins, promotions, inventory, etc?

Production
Recommendations
• A lot of data!

• Activity: ~20GB/day x ~60 days = 1.2TB

• User Data: 2GB

• Purchase Data: ~5GB

• Pre-aggregating loses fidelity for individual users.

Hadoop and Java
(the good)

Integration, integration, integration!
Tooling: IDEs, JCarder, AspectJ,
Maven/Ivy
Developer accessibility

Hadoop and Java
(the bad)

Java is great for applications. Hadoop is
systems programming.
JNI is our hammer
Compression, Security, FS access
C++ wrapper for setuid task execution

Hadoop and Java
(the ugly)

JVM bugs!
Garbage Collection pauses on 50GB
heaps
WORA is a giant lie for systems – worst
of both worlds?

Ok, fine, what next?
Get Hadoop!
CDH - Cloudera‟s Distribution
including Apache Hadoop
http://cloudera.com/
http://hadoop.apache.org/
Try it out! (Locally, VM, or EC2)
Watch free training videos on
http://cloudera.com/

Thanks!
• todd@cloudera.com

• @tlipcon

• (feedback? yes!)

• (hiring? yes!)

EclipseCon Keynote: Apache Hadoop - An Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to EclipseCon Keynote: Apache Hadoop - An Introduction

Similar to EclipseCon Keynote: Apache Hadoop - An Introduction (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

EclipseCon Keynote: Apache Hadoop - An Introduction