1. Respect
for
the
elephant
–
Hadoop
Aditya
Sakhuja
aditya@sakhuja.us
2. Whoami
• So=ware
Engineer
@
Yahoo
Inc.
• Web
Search
-‐>
Cloud
PlaHorms
-‐>
Display
Ads
Serving
• hKp://linkedin.com/in/adityasakhuja
9/24/11
PyCon
UK
2011
3. Agenda
• MoVvaVon
• History
• Ecosystem
• Daemon
processes
/
High
Level
View
• Map
Reduce
Data
Flow
• HDFS
Architecture
/
ReplicaVon
• Can
/
Cannot
• Ge[ng
started
yourself
• Demo
• Companies
Involved
• Q&A
9/24/11
PyCon
UK
2011
4. MoVvaVon
• ‘TradiVonal’
large-‐scale
compuVng
systems
-‐
problems
• Desired
features
in
an
improved
system
• How
Hadoop
addresses
them
9/24/11
PyCon
UK
2011
5. ‘TradiVonal’
large-‐scale
compuVng
systems
-‐
problems
• CPU
intensive
over
Data
intensive
• MPI
,
PVM,
RPCs
–
Parallel
ComputaVon
Frameworks
• Programming
for
tradiVonal
distributed
systems
is
complex
– Data
exchange
requires
synchronizaVon
– Temporal
dependencies
are
complicated
– It
is
difficult
to
deal
with
parVal
failures
of
the
system
• Data
typically
stored
on
SAN
• Data
brought
to
compute
nodes
@
runVme
9/24/11
PyCon
UK
2011
6. Desired
Features
in
a
Large
Scale
Data
Systems
• Data
Driven
– A
new
improved
system
should
avoid
data
boKlenecks
• Scalable
• Consistent
• Recoverable
(
Data
/
Processor
)
• ParVal
Failure
Support
9/24/11
PyCon
UK
2011
7. What
Hadoop
offers
• Provides
a
high
level
programming
model
– No
worries
for
Locking/Temporal
Dependencies,
Sockets
..
• and
the
list
of
features
in
the
desired
list
J
(
previous
slide
)
9/24/11
PyCon
UK
2011
8. History
• Hadoop
is
based
on
work
done
by
Google
in
the
late
1990s/early
2000s
• Specifically,
on
papers
describing
the
Google
File
System
(GFS)published
in
2003,
and
Map/
Reduce
published
in
2004
• Hadoop
MapReduce
NextGeneraVon
–
2011
– hKp://developer.yahoo.com/blogs/hadoop/
posts/2011/02/mapreduce-‐nextgen/
9/24/11
PyCon
UK
2011
9. Apache
Hadoop
Ecosystem
• Hadoop
Common:
The
common
uVliVes
that
support
the
other
Hadoop
subprojects.
• Hadoop
Distributed
File
System
(HDFS™):
A
distributed
file
system
that
provides
high-‐
throughput
access
to
applicaVon
data.
• Hadoop
MapReduce:
A
so=ware
framework
for
distributed
processing
of
large
data
sets
on
compute
clusters.
Other
Hadoop-‐related
projects
at
Apache
include:
• Cassandra™:
A
scalable
mulV-‐master
database
with
no
single
points
of
failure.
• HBase™:
A
scalable,
distributed
database
that
supports
structured
data
storage
for
large
tables.
• Hive™:
A
data
warehouse
infrastructure
that
provides
data
summarizaVon
and
ad
hoc
querying.
• Mahout™:
A
Scalable
machine
learning
and
data
mining
library.
• Pig™:
A
high-‐level
data-‐flow
language
and
execuVon
framework
for
parallel
computaVon.
Source
:
hKp://hadoop.apache.org/
9/24/11
PyCon
UK
2011
15. Map
Reduce
Program
Components
• MapReduce
programs
generally
consist
of
three
porVons
–
The
Mapper
–
The
Reducer
– The
driver
code
• AddiVonal
components
:
– Combiner
(o=en
the
same
code
as
the
Reducer)
– Custom
ParVVoner
9/24/11
PyCon
UK
2011
16. Hadoop
Is
/
Is
Not
• High
Bandwidth,
High
Latency
System
• Not
a
subsVtute
for
a
DBMS,
not
alone
at-‐least
• HDFS
is
not
yet
a
Highly
Available
FS.
NameNode
is
a
SPOF
• Is
a
“Share
nothing”
Architecture
– Mappers
do
not
talk,
neither
do
Reducers
9/24/11
PyCon
UK
2011
17. Ge[ng
started
yourself
Requirements
:
• Java
SE
SDK
[download
JDK
6
or
higher
)
• Download
and
Install
Hadoop
Common
:
0.20.203.X
-‐
current
stable
version
Hadoop
HDFS
:
0.21
–
stable
version
Hadoop
MapReduce
:
0.21
–
stable
version
• Subscribe
to
mailing
lists
for
Hadoop
subprojects,
depending
on
your
role
• AddiVonally/AlternaVvely
one
can
setup
VMs
from
Cloudera
/
Yahoo
• Details
:
• hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop
• hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic
9/24/11
PyCon
UK
2011
18. Simple
Demo
• Using
– Pig
– Map/Reduce
9/24/11
PyCon
UK
2011
19. Streaming
Jobs
• Any
language
that
can
read
from
stdin
and
write
to
stdout
• hadoop
jar
$HADOOP_HOME/hadoop-‐streaming.jar
-‐input
myInputDirs
-‐output
myOutputDir
-‐mapper
myMapScript.py
-‐reducer
myReduceScript.py
-‐file
myMapScript.py
-‐file
myReduceScript.py
9/24/11
PyCon
UK
2011
20. Companies
involved
• Yahoo
-‐
4500
nodes
cluster
(
2*4
cores,
4*1
TBs
Disk
,
16GB
RAM
)
–
(
AdServer,
Search
)
• HortonWorks
,
Cloudera
• Facebook
• A9
(
Amazon
Product
Search
)
• EBay
-‐
532
node
cluster
–
(
8
*
532
cores
,
5.3
PB
)
• Last.fm,
TwiKer
…
• ……
a
lot
more
can
be
found
on
the
link
below
:
hKp://wiki.apache.org/hadoop/PoweredBy
9/24/11
PyCon
UK
2011