Contenu connexe
Similaire à Hadoop operations (20)
Hadoop operations
- 2. What we’ll cover?
¡ Understand
Hadoop
in
detail
¡ See
how
Hadoop
works
operationally
¡ Be
able
to
start
asking
the
right
questions
from
your
data
Lynx
Consultants
©
2013
- 5. Hadoop Components
¡ HDFS
§ Hadoop
Distributed
File
System
§ Everything
sits
on
top
of
it
§ Has
3
copies
by
default
of
every
block
¡ Hbase
¡ MapRed
¡ YARN
Lynx
Consultants
©
2013
- 6. Hadoop Components
¡ HDFS
¡ Hbase
§ Hadoop
Schemaless
Database
§ Key
value
Store
§ Sits
on
top
of
HDFS
¡ MapRed
¡ YARN
Lynx
Consultants
©
2013
- 7. Hadoop Components
¡ HDFS
¡ Hbase
¡ MapRed
§ Hadoop
Map/Reduce
§ Non-‐pluggable,
archaic
§ Requires
HDFS
for
temp
storage
¡ YARN
Lynx
Consultants
©
2013
- 8. Hadoop Components
¡ HDFS
¡ Hbase
¡ MapRed
¡ YARN
§ Hadoop
Map/Reduce
version
2.0
§ Pluggable,
you
can
add
your
own
§ Fast
and
not
so
much
memory
hungry
Lynx
Consultants
©
2013
- 9. Hadoop Component Breakdown
¡ All
these
components
divide
themselves
in
§ client/server
§ master/slave
scenarios
¡ We
will
now
check
each
individual
component
breakdown
Lynx
Consultants
©
2013
- 10. Hadoop Components Breakdown
¡ HDFS
§ Master
Namenode
▪ Keeps
track
of
all
file
allocation
on
Datanodes
▪ Rebalances
data
if
one
of
the
namenodes
goes
down
▪ Is
Rack
aware
§ Secondary
Namenode
▪ Does
cleanup
services
for
the
namenode
▪ Not
necessarily
two
different
servers
§ Datanode
▪ Stores
the
data
▪ Good
to
have
not
RAID
disks
for
extra
I/O
speed
Lynx
Consultants
©
2013
- 11. Hadoop Components Breakdown
¡ HDFS
§ How
to
access
▪ Client
can
connect
with
hadoop
client
to
hdfs://namenode:8020
▪ Supports
all
basic
Unix
commands
§ Configuration
files
▪ /etc/hadoop/conf/core-‐site.xml
▪ Defines
major
configuration
as
hdfs
namenode
and
default
parameters
▪ /etc/hadoop/conf/hdfs-‐site.xml
▪ Defines
configuration
specific
to
namenode
or
datanode
on
file
locations
▪ /etc/hadoop/conf/slaves
▪ Defines
the
list
of
servers
that
are
available
in
this
cluster
Lynx
Consultants
©
2013
- 12. Hadoop Components Breakdown
¡ Hbase
§ Master
▪ Controls
the
Hbase
cluster,
knows
where
the
data
is
allocated
and
provides
a
client
listening
socket
using
Thrift
and/or
a
RESTful
API
§ Regionserver
▪ Hbase
node,
stores
some
of
the
information
in
one
of
the
regions,
it’d
be
equivalent
to
sharding
§ Thrift
/
REST
▪ Interface
to
connect
to
HBase
Lynx
Consultants
©
2013
- 13. Hadoop Components Breakdown
¡ Hbase
§ How
to
access
▪ Through
the
Hbase
client
(using
Thrift)
▪ Through
the
RESTful
API
§ Configuration
files
▪ /etc/hbase/conf/hbase-‐site.xml
▪ Defines
all
the
basic
configuration
for
accessing
hbase
▪ /etc/hbase/conf/hbase-‐policy.xml
▪ Defines
all
the
security
(ACL)
and
all
the
hbase
memory
tweaks
▪ /etc/hbase/conf/regionservers
▪ List
all
the
regionservers
available
to
this
cluster
Lynx
Consultants
©
2013
- 14. Hadoop Components Breakdown
¡ MapRed
§ JobTracker
▪ Creates
the
Map/Reduce
jobs
▪ Stores
all
the
intermediate
data
▪ Keeps
track
of
all
the
previous
results
through
the
HistoryServer
§ TaskTracker
▪ Executed
Tasks
related
to
the
Map/Reduce
job
▪ Very
CPU
and
memory
intensive
▪ Stores
intermediate
results
which
then
are
pushed
to
JobTracker
Lynx
Consultants
©
2013
- 15. Hadoop Components Breakdown
¡ MapRed
§ How
to
access
▪ Through
the
Hadoop
Client
▪ Through
any
MapRed
client
like
Pig
or
Hive
▪ Own
Java
code
§ Configuration
files
▪ /etc/hadoop/conf/mapred-‐site.xml
▪ Defines
how
to
contact
this
MapRed
Cluster
▪ /etc/hadoop/conf/mapred-‐queue-‐acls.xml
▪ Defines
ACL
structure
for
accessing
MapRed,
normally
not
necessary
▪ /etc/hadoop/conf/slaves
▪ Defines
the
list
of
TaskTrackers
in
this
cluster
Lynx
Consultants
©
2013
- 16. Hadoop Components Breakdown
¡ YARN
§ Same
structure
as
MapRed
(lives
on
top
of
it)
§ Configuration
files
▪ /etc/hadoop/conf/yarn-‐site.xml
▪ All
required
configuration
for
YARN
Lynx
Consultants
©
2013
- 17. Hadoop Cluster Breakdown
¡ Namenode
Server
§ HDFS
Namenode
§ Hbase
Master
¡ Secondary
Namenode
Server
§ HDFS
Secondary
Namenode
¡ JobTracker
Server
§ MapRed
JobTracker
§ MapRed
History
Server
Lynx
Consultants
©
2013
- 19. Hadoop Hardware Requirements
¡ Namenode
Server
§ Redundant
power
supplies
§ RAID1
Drives
§ Enough
memory
(16Gb)
¡ Secondary
Namenode
Server
§ Almost
none
Lynx
Consultants
©
2013
- 20. Hadoop Hardware Requirements
¡ Jobtracker
Server
§ Redundant
power
supplies
§ RAID1
Drives
§ Enough
memory
(16Gb)
¡ Datanode
Server
§ Lots
of
cheap
disk
(no
RAID)
§ Lots
of
memory
(32Gb)
§ Lots
of
CPU
Lynx
Consultants
©
2013
- 21. Hadoop Default Ports
¡ HDFS
§ 8020:
HDFS
Namenode
§ 50010:
HDFS
Datanode
FS
transfer
¡ MapRed
§ No
defaults
¡ Hbase
§ 60010:
Master
§ 60020:
Regionserver
Lynx
Consultants
©
2013
- 25. Flume
¡ Transports
streams
of
data
from
point
A
to
point
B
¡ Source
§ Where
the
data
is
read
from
¡ Channel
§ How
the
data
is
buffered
¡ Sink
§ Where
the
data
is
written
Lynx
Consultants
©
2013
- 26. Flume
¡ Flume
is
fault
tolerant
¡ Sources
are
pointer
kept
§ With
some
exceptions,
but
most
sources
are
in
a
known
state
¡ Channels
can
be
fault
tolerant
§ Channel
written
to
disk
can
recover
from
where
it
left
¡ Sinks
can
be
redundant
§ More
than
one
sink
for
the
same
data
§ Data
is
serialised
and
deduplicated
using
AVRO
Lynx
Consultants
©
2013
- 28. Flume
¡ Configuration
files
§ /etc/flume-‐ng/conf/flume.conf
▪ Defines
the
agent
configuration
with
source,
channel,
sink
Lynx
Consultants
©
2013
- 31. Hadoop References
¡ Hadoop
§ http://hadoop.apache.org/docs/stable/cluster_setup.html
§ http://rc.cloudera.com/cdh/4/hadoop/hadoop-‐yarn/hadoop-‐yarn-‐site/
ClusterSetup.html
§ http://pig.apache.org/docs/r0.7.0/setup.html
§ http://wiki.apache.org/hadoop/NameNodeFailover
¡ Hbase
§ http://hbase.apache.org/book/book.html
¡ Flume
§ http://archive.cloudera.com/cdh4/cdh/4/flume-‐ng/
FlumeUserGuide.html
Lynx
Consultants
©
2013