Hadoop operations

Marc
Cluet
–
Lynx
Consultants

How
Hadoop
Works

What we’ll cover?
¡  Understand
Hadoop
in
detail

¡  See
how
Hadoop
works
operationally

¡  Be
able
to
start
asking
the
right
questions
from
your
data

Lynx
Consultants
©
2013

Hadoop Distributions
¡  Cloudera
CDH

¡  Hortonworks

¡  MapR

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

¡  Hbase

¡  MapRed

¡  YARN

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

§  Hadoop
Distributed
File
System

§  Everything
sits
on
top
of
it

§  Has
3
copies
by
default
of
every
block

¡  Hbase

¡  MapRed

¡  YARN

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

¡  Hbase

§  Hadoop
Schemaless
Database

§  Key
value
Store

§  Sits
on
top
of
HDFS

¡  MapRed

¡  YARN

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

¡  Hbase

¡  MapRed

§  Hadoop
Map/Reduce

§  Non-‐pluggable,
archaic

§  Requires
HDFS
for
temp
storage

¡  YARN

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

¡  Hbase

¡  MapRed

¡  YARN

§  Hadoop
Map/Reduce
version
2.0

§  Pluggable,
you
can
add
your
own

§  Fast
and
not
so
much
memory
hungry

Lynx
Consultants
©
2013

Hadoop Component Breakdown
¡  All
these
components
divide
themselves
in

§  client/server

§  master/slave
scenarios

¡  We
will
now
check
each
individual
component
breakdown

Lynx
Consultants
©
2013

Hadoop Components Breakdown
¡  HDFS

§  Master
Namenode

▪  Keeps
track
of
all
ﬁle
allocation
on
Datanodes

▪  Rebalances
data
if
one
of
the
namenodes
goes
down

▪  Is
Rack
aware

§  Secondary
Namenode

▪  Does
cleanup
services
for
the
namenode

▪  Not
necessarily
two
diﬀerent
servers

§  Datanode

▪  Stores
the
data

▪  Good
to
have
not
RAID
disks
for
extra
I/O
speed

Lynx
Consultants
©
2013

¡  HDFS

§  How
to
access

▪  Client
can
connect
with
hadoop
client
to
hdfs://namenode:8020

▪  Supports
all
basic
Unix
commands

§  Configuration
files

▪  /etc/hadoop/conf/core-‐site.xml

▪  Defines
major
configuration
as
hdfs
namenode
and
default
parameters

▪  /etc/hadoop/conf/hdfs-‐site.xml

▪  Defines
configuration
specific
to
namenode
or
datanode
on
file
locations

▪  /etc/hadoop/conf/slaves

▪  Defines
the
list
of
servers
that
are
available
in
this
cluster

Lynx
Consultants
©
2013

¡  Hbase

§  Master

▪  Controls
the
Hbase
cluster,
knows
where
the
data
is
allocated
and

provides
a
client
listening
socket
using
Thrift
and/or
a
RESTful
API

§  Regionserver

▪  Hbase
node,
stores
some
of
the
information
in
one
of
the
regions,

it’d
be
equivalent
to
sharding

§  Thrift
/
REST

▪  Interface
to
connect
to
HBase

Lynx
Consultants
©
2013

¡  Hbase

§  How
to
access

▪  Through
the
Hbase
client
(using
Thrift)

▪  Through
the
RESTful
API

files

▪  /etc/hbase/conf/hbase-‐site.xml

▪  Defines
all
the
basic
configuration
for
accessing
hbase

▪  /etc/hbase/conf/hbase-‐policy.xml

▪  Defines
all
the
security
(ACL)
and
all
the
hbase
memory
tweaks

▪  /etc/hbase/conf/regionservers

▪  List
all
the
regionservers
available
to
this
cluster

Lynx
Consultants
©
2013

¡  MapRed

§  JobTracker

▪  Creates
the
Map/Reduce
jobs

▪  Stores
all
the
intermediate
data

▪  Keeps
track
of
all
the
previous
results
through
the
HistoryServer

§  TaskTracker

▪  Executed
Tasks
related
to
the
Map/Reduce
job

▪  Very
CPU
and
memory
intensive

▪  Stores
intermediate
results
which
then
are
pushed
to
JobTracker

Lynx
Consultants
©
2013

¡  MapRed

§  How
to
access

▪  Through
the
Hadoop
Client

▪  Through
any
MapRed
client
like
Pig
or
Hive

▪  Own
Java
code

files

▪  /etc/hadoop/conf/mapred-‐site.xml

▪  Defines
how
to
contact
this
MapRed
Cluster

▪  /etc/hadoop/conf/mapred-‐queue-‐acls.xml

▪  Defines
ACL
structure
for
accessing
MapRed,
normally
not
necessary

▪  /etc/hadoop/conf/slaves

▪  Defines
the
list
of
TaskTrackers
in
this
cluster

Lynx
Consultants
©
2013

¡  YARN

§  Same
structure
as
MapRed
(lives
on
top
of
it)

ﬁles

▪  /etc/hadoop/conf/yarn-‐site.xml

▪  All
required
conﬁguration
for
YARN

Lynx
Consultants
©
2013

Hadoop Cluster Breakdown
¡  Namenode
Server

§  HDFS
Namenode

§  Hbase
Master

¡  Secondary
Namenode
Server

§  HDFS
Secondary
Namenode

¡  JobTracker
Server

§  MapRed
JobTracker

§  MapRed
History
Server

Lynx
Consultants
©
2013

Hadoop Cluster Breakdown
¡  Datanode
Server

§  HDFS
Datanode

§  Hbase
RegionServer

§  MapRed
TaskTracker

Lynx
Consultants
©
2013

Hadoop Hardware Requirements
¡  Namenode
Server

§  Redundant
power
supplies

§  RAID1
Drives

§  Enough
memory
(16Gb)

¡  Secondary
Namenode
Server

§  Almost
none

Lynx
Consultants
©
2013

Hadoop Hardware Requirements
¡  Jobtracker
Server

§  Redundant
power
supplies

§  RAID1
Drives

§  Enough
memory
(16Gb)

¡  Datanode
Server

§  Lots
of
cheap
disk
(no
RAID)

§  Lots
of
memory
(32Gb)

§  Lots
of
CPU

Lynx
Consultants
©
2013

Hadoop Default Ports
¡  HDFS

§  8020:
HDFS
Namenode

§  50010:
HDFS
Datanode
FS
transfer

¡  MapRed

§  No
defaults

¡  Hbase

§  60010:
Master

§  60020:
Regionserver

Lynx
Consultants
©
2013

Flume
¡  Transports
streams
of
data
from
point
A
to
point
B

¡  Source

§  Where
the
data
is
read
from

¡  Channel

§  How
the
data
is
buﬀered

¡  Sink

§  Where
the
data
is
written

Lynx
Consultants
©
2013

Flume
¡  Flume
is
fault
tolerant

¡  Sources
are
pointer
kept

§  With
some
exceptions,
but
most
sources
are
in
a
known
state

¡  Channels
can
be
fault
tolerant

§  Channel
written
to
disk
can
recover
from
where
it
left

¡  Sinks
can
be
redundant

§  More
than
one
sink
for
the
same
data

§  Data
is
serialised
and
deduplicated
using
AVRO

Lynx
Consultants
©
2013

Flume
¡  Configuration
files

§  /etc/flume-‐ng/conf/flume.conf

▪  Defines
the
agent
configuration
with
source,
channel,
sink

Lynx
Consultants
©
2013

Hadoop References
¡  Hadoop

§  http://hadoop.apache.org/docs/stable/cluster_setup.html

§  http://rc.cloudera.com/cdh/4/hadoop/hadoop-‐yarn/hadoop-‐yarn-‐site/
ClusterSetup.html

§  http://pig.apache.org/docs/r0.7.0/setup.html

§  http://wiki.apache.org/hadoop/NameNodeFailover

¡  Hbase

§  http://hbase.apache.org/book/book.html

¡  Flume

§  http://archive.cloudera.com/cdh4/cdh/4/ﬂume-‐ng/
FlumeUserGuide.html

Lynx
Consultants
©
2013

Hadoop operations

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop operations

Similaire à Hadoop operations (20)

Plus de Marc Cluet

Plus de Marc Cluet (20)

Dernier

Dernier (20)

Hadoop operations