Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Introducing the data science sandbox as a service 8.30.18
EclipseCon Keynote: Apache Hadoop - An Introduction
1. Apache Hadoop
an introduction
Todd Lipcon
todd@cloudera.com
@tlipcon @cloudera
March 24, 2011
2. Introductions
Software Engineer at
Apache Hadoop, HBase, Thrift
committer
Previously: systems programming,
operations, large scale data analysis
I love data and data systems
3. Outline
Why should you care? (Intro)
What is Hadoop?
How does it work?
The Hadoop Ecosystem
Use Cases
Experiences as a developer
7. Photo by C.C Chapman (CC BY-NC-ND)
http://www.flickr.com/photos/cc_chapman/3342268874/
8.
9.
10.
11. “Every two days we create as
much information as we did
from the dawn of civilization
up until 2003.”
Eric Schmidt
12. “I keep saying that the sexy
job in the next 10 years will be
statisticians. And I‟m not
kidding.”
Hal Varian
(Google‟s chief economist)
13. Are you throwing
away data?
Data comes in many shapes and
sizes: relational tuples, log files,
semistructured textual data (e.g., e-
mail), … .
Are you throwing it away because it
doesn‟t „fit‟?
14. So, what‟s
Hadoop?
The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
15. So, what‟s
Hadoop?
The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
16. Apache Hadoop is an
open-source system
to reliably store and process
GOBS of data
across many commodity computers.
17. Two Core
Components
Store Process
HDFS Map/Reduce
Self-healing Fault-tolerant
high-bandwidth distributed
clustered storage. processing.
22. Hadoop lets you interact
with a cluster, not a
bunch of machines.
Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
23. Falsehood #3: Your analysis fits on one machine…
Image: Matthew J. Stinson CC-BY-NC
24. Hadoop scales linearly
with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:
Extensive machine learning on <100GB of image
data
Simple SQL queries on >100TB of clickstream
data
Hadoop works for both applications!
25. Hadoop sounds like
magic.
Coincidentally, today is
Houdini‟s birthday, though
he was not a Hadoop
committer.
How is it possible?
26. A Typical Look...
5-4000 commodity servers
(8-core, 24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per rack
33. You specify map()
and reduce()
functions.
The framework
does the rest.
34. map()
map: K₁,V₁→list K₂,V₂
Key: byte offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”
Key: userimage
Value: 2326 bytes
The map function runs on the same node as the data
was stored!
35. Input Format
• Wait! HDFS is not a Key-Value store!
• InputFormat interprets bytes as a Key
and Value
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700]
"GET /userimage/123 HTTP/1.0" 200 2326
Key: log offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”
36. The Shuffle
Each map output is assigned to a
“reducer” based on its key
map output is grouped and
sorted by key
39. Hadoop is
not just MapReduce
(NoNoSQL!)
Hive project adds SQL
support to Hadoop
HiveQL (SQL dialect)
compiles to a query plan
Query plan executes as
MapReduce jobs
40. Hive Example
CREATE TABLE movie_rating_data (
userid INT, movieid INT, rating INT, unixtime STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't„
STORED AS TEXTFILE;
LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE
movie_rating_data;
CREATE TABLE average_ratings AS
SELECT movieid, AVG(rating) FROM movie_rating_data
GROUP BY movieid;
42. Hadoop in the Wild
(yes, it‟s used in production)
Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ‟09)
Facebook: 15TB new data per day;
1200 machines, 21PB in one cluster
Twitter: >1TB per day, ~120 nodes
Lots of 5-40 node clusters at companies without
petabytes of data (web, retail, finance, telecom,
research)
44. Product
Recommendations
• Naïve approach: Users who bought toothpaste bought
toothbrushes.
• Hadoop approach
• What products did a user browse, hover over, rate,
add to cart (but not buy), etc in the last 2 months?
• What are the attributes of the user?
• What are our margins, promotions, inventory, etc?
45. Production
Recommendations
• A lot of data!
• Activity: ~20GB/day x ~60 days = 1.2TB
• User Data: 2GB
• Purchase Data: ~5GB
• Pre-aggregating loses fidelity for individual users.
46.
47.
48.
49.
50.
51.
52. Hadoop and Java
(the good)
Integration, integration, integration!
Tooling: IDEs, JCarder, AspectJ,
Maven/Ivy
Developer accessibility
53. Hadoop and Java
(the bad)
Java is great for applications. Hadoop is
systems programming.
JNI is our hammer
Compression, Security, FS access
C++ wrapper for setuid task execution
54. Hadoop and Java
(the ugly)
JVM bugs!
Garbage Collection pauses on 50GB
heaps
WORA is a giant lie for systems – worst
of both worlds?
55. Ok, fine, what next?
Get Hadoop!
CDH - Cloudera‟s Distribution
including Apache Hadoop
http://cloudera.com/
http://hadoop.apache.org/
Try it out! (Locally, VM, or EC2)
Watch free training videos on
http://cloudera.com/