Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
An Introduction to Apache Hadoop and its Core Components HDFS and MapReduce
1. Apache Hadoop
an introduction
Todd Lipcon
todd@cloudera.com
@tlipcon @cloudera
May 27, 2010
Thursday, May 27, 2010
2. Hi there!
Software Engineer at
Hadoop contributor, HBase committer
Previously: systems programming,
operations, large scale data analysis
I love data and data systems
Thursday, May 27, 2010
3. Outline
Why should you care? (Intro)
What is Hadoop?
The MapReduce Model
HDFS, Hadoop Map/Reduce
The Hadoop Ecosystem
Questions
Thursday, May 27, 2010
9. “I keep saying that the sexy job
in the next 10 years will be
statisticians, and I’m not kidding.”
Hal Varian
(Google’s chief economist)
Thursday, May 27, 2010
10. Are you throwing
away data?
Data comes in many shapes and sizes:
relational tuples, log files, semistructured
textual data (e.g., e-mail), … .
Are you throwing it away because it
doesn’t ‘fit’?
Thursday, May 27, 2010
11. So, what’s Hadoop?
The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
Thursday, May 27, 2010
12. Apache Hadoop is an
open-source system
to reliably store and process
gobs of information
across many commodity computers.
The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
Thursday, May 27, 2010
13. Two Core
Components
HDFS Map/Reduce
Self-healing
high-bandwidth Fault-tolerant
clustered storage. distributed computing.
Thursday, May 27, 2010
14. What makes
Hadoop special?
Thursday, May 27, 2010
15. Assumption 1: Machines can be reliable...
Image: MadMan the Mighty CC BY-NC-SA
Thursday, May 27, 2010
16. Hadoop separates distributed
system fault-tolerance code
from application logic.
Unicorns
Systems
Statisticians
Programmers
Thursday, May 27, 2010
17. Assumption 2: Machines have identities...
Image:Laughing Squid CC BY-
NC-SA
Thursday, May 27, 2010
18. Hadoop lets you interact
with a cluster, not a bunch
of machines.
Image:Yahoo! Hadoop cluster
[ OSCON ’07 ]
Thursday, May 27, 2010
19. Assumption 3: Your analysis fits on one machine
Image: Matthew J. Stinson CC-BY-NC
Thursday, May 27, 2010
20. Hadoop scales linearly
with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:
Extensive machine learning on <100GB of image data
Simple SQL-style queries on >100TB of clickstream
data
One Hadoop works for both applications!
Thursday, May 27, 2010
21. A Typical Look...
5-4000 commodity servers
(8-core, 8-24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per rack
Thursday, May 27, 2010
22. Image: Josh Hough CC BY-NC-SA
STOP!
REAL METAL?
Isn’t this some kind of
“Cloud Computing” conference?
Hadoop runs as a cloud (a cluster)
and maybe in a cloud (eg EC2).
Thursday, May 27, 2010
24. dramatis personae
Starring...
NameNode (metadata server and database)
SecondaryNameNode (assistant to NameNode)
JobTracker (scheduler)
The Chorus…
DataNodes TaskTrackers
(block storage) (task execution)
Thanks to Zak Stone for earmuff image!
Thursday, May 27, 2010
25. Namenode HDFS
3x64MB file, 3 rep
(fs metadata)
4x64MB file, 3 rep
Small file, 7 rep
Datanodes
Thursday, May 27, 2010
One Rack A Different Rack
27. HDFS Failures?
Datanode crash?
Clients read another copy
Background rebalance/rereplicate
Namenode crash?
uh-oh
not responsible for
majority of downtime!
Thursday, May 27, 2010
28. The M/R
Programming Model
Thursday, May 27, 2010
29. You specify map()
and reduce()
functions.
The framework does
the rest.
Thursday, May 27, 2010
31. map()
map: K₁,V₁→list K₂,V₂
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException,
InterruptedException {
// context.write() can be called many times
// this is default “identity mapper” implementation
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
Thursday, May 27, 2010
32. (the shuffle)
map output is assigned to a “reducer”
map output is sorted by key
Thursday, May 27, 2010
33. reduce()
K₂, iter(V₂)→list(K₃,V₃)
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
/**
* This method is called once for each key. Most applications will define
* their reduce class by overriding this method. The default implementation
* is an identity function.
*/
@SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
}
Thursday, May 27, 2010
35. Some samples...
Build an inverted index.
Summarize data grouped by a key.
Build map tiles from geographic data.
OCRing many images.
Learning ML models. (e.g., Naive Bayes
for text classification)
Augment traditional BI/DWH
technologies (by archiving raw data).
Thursday, May 27, 2010
36. M/R
Job on stars
Tasktrackers on the same Different job
machines as datanodes Idle
One Rack A Different Rack
Thursday, May 27, 2010
38. M/R Failures
Task fails
Try again?
Try again somewhere else?
Report failure
Retries possible because of idempotence
Thursday, May 27, 2010
39. There’s more than the
Java API
Streaming Pig Hive
perl, python, Higher-level SQL interface.
ruby, whatever. dataflow language
for easy ad-hoc Great for
stdin/stdout/ analysis. analysts.
stderr
Developed at Developed at
Yahoo! Facebook
Many tasks actually require a series
of M/R jobs; that’s ok!
Thursday, May 27, 2010
41. Hadoop in the Wild
(yes, it’s used in production)
Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ’09)
Facebook: 15TB new data per day;
10000+ cores, 12+ PB
Twitter: ~1TB per day, ~80 nodes
Lots of 5-40 node clusters at companies without petabytes
of data (web, retail, finance, telecom, research)
Thursday, May 27, 2010
42. Ok, fine, what next?
Get Hadoop!
Cloudera’s Distribution for Hadoop
http://hadoop.apache.org/
Try it out! (Locally, or on EC2) Door
Prize
Watch free training videos on
http://cloudera.com/
Thursday, May 27, 2010