Hadoop pycon2011uk

Respect
for
the
elephant
–
Hadoop

Aditya
Sakhuja

aditya@sakhuja.us

Whoami

•  So=ware
Engineer
@
Yahoo
Inc.

•  Web
Search
-‐>
Cloud
PlaHorms
-‐>
Display
Ads
Serving

•  hKp://linkedin.com/in/adityasakhuja

9/24/11
PyCon
UK
2011

Agenda

•  MoVvaVon

•  History

•  Ecosystem

•  Daemon
processes
/
High
Level
View

•  Map
Reduce
Data
Flow

•  HDFS
Architecture
/
ReplicaVon

•  Can
/
Cannot

•  Ge[ng
started
yourself

•  Demo

•  Companies
Involved

•  Q&A

9/24/11
PyCon
UK
2011

MoVvaVon

•  ‘TradiVonal’
large-‐scale
compuVng
systems
-‐

problems

•  Desired
features
in
an
improved
system

•  How
Hadoop
addresses
them

9/24/11
PyCon
UK
2011

‘TradiVonal’
large-‐scale
compuVng
systems
-‐

problems

•  CPU
intensive
over
Data
intensive

•  MPI
,
PVM,

RPCs
–
Parallel
ComputaVon

Frameworks

•  Programming
for
tradiVonal
distributed
systems

is
complex

–  Data
exchange
requires
synchronizaVon

–  Temporal
dependencies
are
complicated

–  It
is
diﬃcult
to
deal
with
parVal
failures
of
the
system

•  Data
typically
stored
on
SAN

•  Data
brought
to
compute
nodes
@
runVme

9/24/11
PyCon
UK
2011

Desired
Features
in
a
Large
Scale
Data
Systems

•  Data
Driven

–  A
new
improved
system
should
avoid
data

boKlenecks

•  Scalable

•  Consistent

•  Recoverable

(
Data
/
Processor
)

•  ParVal
Failure
Support

9/24/11
PyCon
UK
2011

What
Hadoop
oﬀers

•  Provides
a
high
level
programming
model

–  No
worries
for
Locking/Temporal
Dependencies,

Sockets
..

•  and
the
list
of
features
in
the
desired
list
J

(
previous
slide
)

9/24/11
PyCon
UK
2011

History

•  Hadoop
is
based
on
work
done
by
Google
in

the
late
1990s/early
2000s

•  Speciﬁcally,
on
papers
describing
the
Google

File
System
(GFS)published
in
2003,
and
Map/
Reduce
published
in
2004

•  Hadoop
MapReduce
NextGeneraVon
–
2011

–  hKp://developer.yahoo.com/blogs/hadoop/
posts/2011/02/mapreduce-‐nextgen/

9/24/11
PyCon
UK
2011

Apache
Hadoop
Ecosystem

•  Hadoop
Common:
The
common
uVliVes
that
support
the
other
Hadoop
subprojects.

•  Hadoop
Distributed
File
System
(HDFS™):
A
distributed
ﬁle
system
that
provides
high-‐
throughput
access
to
applicaVon
data.

•  Hadoop
MapReduce:
A
so=ware
framework
for
distributed
processing
of
large
data
sets

on
compute
clusters.

Other
Hadoop-‐related
projects
at
Apache
include:

•  Cassandra™:
A
scalable
mulV-‐master
database
with
no
single
points
of
failure.

•  HBase™:
A
scalable,
distributed
database
that
supports
structured
data
storage
for
large

tables.

•  Hive™:
A
data
warehouse
infrastructure
that
provides
data
summarizaVon
and
ad
hoc

querying.

•  Mahout™:
A
Scalable
machine
learning
and
data
mining
library.

•  Pig™:
A
high-‐level
data-‐ﬂow
language
and
execuVon
framework
for
parallel

computaVon.

Source
:
hKp://hadoop.apache.org/

9/24/11
PyCon
UK
2011

Hadoop
Key
Daemon
Processes

•  Namenode

•  Secondary
NameNode

•  DataNode

•  JobTracker

•  TaskTracker

9/24/11
PyCon
UK
2011

High
level
Hadoop
cluster
view

9/24/11
PyCon
UK
2011

MapReduce
Data
Flow

9/24/11
PyCon
UK
2011

HDFS
Architecture

9/24/11
PyCon
UK
2011

HDFS
ReplicaVon

9/24/11
PyCon
UK
2011

Map
Reduce
Program
Components

•  MapReduce
programs
generally
consist
of

three
porVons

– 
The
Mapper

– 
The
Reducer

–  The
driver
code

•  AddiVonal
components
:

–  Combiner
(o=en
the
same
code
as
the
Reducer)

–  Custom
ParVVoner

9/24/11
PyCon
UK
2011

Hadoop
Is
/
Is
Not

•  High
Bandwidth,
High
Latency
System

•  Not
a
subsVtute
for
a
DBMS,
not
alone
at-‐least

•  HDFS
is
not
yet
a
Highly
Available
FS.

NameNode
is
a
SPOF

•  Is
a
“Share
nothing”
Architecture

–  Mappers
do
not
talk,
neither
do
Reducers

9/24/11
PyCon
UK
2011

Ge[ng
started
yourself

Requirements
:

•  Java
SE
SDK
[download
JDK
6
or
higher
)

•  Download
and
Install

Hadoop
Common

:
0.20.203.X
-‐
current
stable
version

Hadoop
HDFS
:
0.21
–
stable
version

Hadoop
MapReduce
:
0.21
–
stable
version

•  Subscribe
to
mailing
lists

for
Hadoop
subprojects,
depending
on
your

role

•  AddiVonally/AlternaVvely
one
can
setup
VMs
from
Cloudera
/
Yahoo

•  Details
:

•  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop

•  hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic

9/24/11
PyCon
UK
2011

Simple
Demo

•  Using

–  Pig

–  Map/Reduce

9/24/11
PyCon
UK
2011

Streaming
Jobs

•  Any
language
that
can
read
from
stdin
and
write
to
stdout

•  hadoop
jar
$HADOOP_HOME/hadoop-‐streaming.jar

-‐input
myInputDirs

-‐output
myOutputDir

-‐mapper
myMapScript.py

-‐reducer
myReduceScript.py

-‐ﬁle
myMapScript.py

-‐ﬁle
myReduceScript.py

9/24/11
PyCon
UK
2011

Companies
involved

•  Yahoo

-‐
4500
nodes
cluster
(
2*4
cores,
4*1
TBs

Disk
,
16GB
RAM
)
–
(
AdServer,
Search
)

•  HortonWorks
,
Cloudera

•  Facebook

•  A9

(
Amazon
Product
Search
)

•  EBay
-‐
532
node
cluster
–
(
8
*
532
cores
,
5.3
PB
)

•  Last.fm,
TwiKer
…

•  ……
a
lot
more
can
be
found
on
the
link
below
:

hKp://wiki.apache.org/hadoop/PoweredBy

9/24/11
PyCon
UK
2011

Useful
Links

• 

hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop
-‐
Ge[ng
Started

•  hKp://hadoop.apache.org/common/docs/current/cluster_setup.html
-‐
Cluster

Setup

•  hKp://developer.yahoo.com/hadoop/tutorial/module4.html
-‐
MapReduce

•  hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html
-‐
PIG

•  hKp://hadoop.apache.org/common/docs/current/api/index.html
-‐
APIs

•  hKp://developer.yahoo.com/hadoop/tutorial/
-‐
YDN
resource
on
Hadoop

9/24/11
PyCon
UK
2011

Q&C

Contact
InformaFon
:

Aditya
Sakhuja

aditya@sakhuja.us

hKp://twiKer.com/sakhuja

hKp://linkedin.com/in/adityasakhuja

9/24/11
PyCon
UK
2011

Hadoop pycon2011uk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop pycon2011uk

Similaire à Hadoop pycon2011uk (20)

Dernier

Dernier (20)

Hadoop pycon2011uk