Contenu connexe
Similaire à Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley (20)
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
- 1. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Introduc=on
to
Apache
Hadoop
and
its
Ecosystem
Mark
Grover
|
Intro
to
Cloud
Compu=ng,
Carnegie
Mellon
SV
github.com/markgrover/hadoop-‐intro-‐fast
©
Copyright
2010-‐2014
Cloudera,
Inc.
All
rights
reserved.
- 2. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
About
Me
• CommiNer
on
Apache
Bigtop,
commiNer
and
PPMC
member
on
Apache
Sentry
(incuba=ng).
• Contributor
to
Apache
Hadoop,
Hive,
Spark,
Sqoop,
Flume.
• SoUware
developer
at
Cloudera
• @mark_grover
• www.linkedin.com/in/grovermark
- 3. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Co-‐author
O’Reilly
book
• @hadooparchbook
• hadooparchitecturebook.com
• To
be
released
early
2015
- 4. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
About
the
Presenta=on…
• What’s
ahead
• Fundamental
Concepts
• HDFS:
The
Hadoop
Distributed
File
System
• Data
Processing
with
MapReduce
• Demo
• Conclusion
+
Q&A
- 5. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Fundamental
Concepts
Why
the
World
Needs
Hadoop
- 6. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
What’s
the
craze
about
Hadoop?
• Volume
• More
and
more
data
being
generated
• Machine
generated
data
increasing
• Velocity
• Data
coming
it
at
higher
speed
• Variety
• Audio,
video,
images,
log
files,
web
pages,
social
network
connec=ons,
etc.
- 7. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
We
Need
a
System
that
Scales
• Too
much
data
for
tradi=onal
tools
• Two
key
problems
• How
to
reliably
store
this
data
at
a
reasonable
cost
• How
to
we
process
all
the
data
we’ve
stored
- 8. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
What
is
Apache
Hadoop?
• Scalable
data
storage
and
processing
• Distributed
and
fault-‐tolerant
• Runs
on
standard
hardware
• Two
main
components
• Storage:
Hadoop
Distributed
File
System
(HDFS)
• Processing:
MapReduce
• Hadoop
clusters
are
composed
of
computers
called
nodes
• Clusters
range
from
a
single
node
up
to
several
thousand
nodes
- 9. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
How
Did
Apache
Hadoop
Originate?
• Heavily
influenced
by
Google’s
architecture
• Notably,
the
Google
Filesystem
and
MapReduce
papers
• Other
Web
companies
quickly
saw
the
benefits
• Early
adop=on
by
Yahoo,
Facebook
and
others
2002 2003 2004 2005 2006
Google publishes
MapReduce paper
Nutch rewritten
for MapReduce
Hadoop becomes
Lucene subproject
Nutch spun off
from Lucene
Google publishes
GFS paper
- 10. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Comparing
Hadoop
to
Other
Systems
• Monolithic
systems
don’t
scale
• Modern
high-‐performance
compu=ng
systems
are
distributed
• They
spread
computa=ons
across
many
machines
in
parallel
• Widely-‐used
used
for
scien=fic
applica=ons
• Let’s
examine
how
a
typical
HPC
system
works
- 11. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Architecture
of
a
Typical
HPC
System
Storage System
Compute Nodes
Fast Network
- 12. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Architecture
of
a
Typical
HPC
System
Storage System
Compute Nodes
Step 1: Copy input data
Fast Network
- 13. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Architecture
of
a
Typical
HPC
System
Storage System
Compute Nodes
Step 2: Process the data
Fast Network
- 14. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Architecture
of
a
Typical
HPC
System
Storage System
Compute Nodes
Step 3: Copy output data
Fast Network
- 15. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
You
Don’t
Just
Need
Speed…
• The
problem
is
that
we
have
way
more
data
than
code
$ du -ks code/
1,087
$ du –ks data/
854,632,947,314
- 16. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
You
Need
Speed
At
Scale
Storage System
Compute Nodes
Bottleneck
- 17. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Hadoop
Design
Fundamental:
Data
Locality
• This
is
a
hallmark
of
Hadoop’s
design
• Don’t
bring
the
data
to
the
computa=on
• Bring
the
computa=on
to
the
data
• Hadoop
uses
the
same
machines
for
storage
and
processing
• Significantly
reduces
need
to
transfer
data
across
network
- 18. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Other
Hadoop
Design
Fundamentals
• Machine
failure
is
unavoidable
–
embrace
it
• Build
reliability
into
the
system
• “More”
is
usually
beNer
than
“faster”
• Throughput
maNers
more
than
latency
- 19. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
The
Hadoop
Distributed
Filesystem
HDFS
- 20. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
HDFS:
Hadoop
Distributed
File
System
• Inspired
by
the
Google
File
System
• Reliable,
low-‐cost
storage
for
massive
amounts
of
data
• Similar
to
a
UNIX
filesystem
in
some
ways
• Hierarchical
• UNIX-‐style
paths
(e.g.,
/sales/alice.txt)
• UNIX-‐style
file
ownership
and
permissions
- 21. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
HDFS:
Hadoop
Distributed
File
System
• There
are
also
some
major
devia=ons
from
UNIX
filesystems
• Highly-‐op=mized
for
processing
data
with
MapReduce
• Designed
for
sequen=al
access
to
large
files
• Cannot
modify
file
content
once
wriNen
• It’s
actually
a
user-‐space
Java
process
• Accessed
using
special
commands
or
APIs
• No
concept
of
a
current
working
directory
- 22. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Copying
Local
Data
To
and
From
HDFS
• Remember
that
HDFS
is
dis=nct
from
your
local
filesystem
• hadoop fs –put
copies
local
files
to
HDFS
• hadoop fs –get
fetches
a
local
copy
of
a
file
from
HDFS
$ hadoop fs -put sales.txt /reports
Hadoop Cluster
Client Machine
$ hadoop fs -get /reports/sales.txt
- 23. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
HDFS
Demo
• I
will
now
demonstrate
the
following
1. How
to
list
the
contents
of
a
directory
2. How
to
create
a
directory
in
HDFS
3. How
to
copy
a
local
file
to
HDFS
4. How
to
display
the
contents
of
a
file
in
HDFS
5. How
to
remove
a
file
from
HDFS
- 24. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
A
Scalable
Data
Processing
Framework
Data
Processing
with
MapReduce
- 25. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
What
is
MapReduce?
• MapReduce
is
a
programming
model
• It’s
a
way
of
processing
data
• You
can
implement
MapReduce
in
any
language
- 26. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Understanding
Map
and
Reduce
• You
supply
two
func=ons
to
process
data:
Map
and
Reduce
• Map:
typically
used
to
transform,
parse,
or
filter
data
• Reduce:
typically
used
to
summarize
results
• The
Map
func=on
always
runs
first
• The
Reduce
func=on
runs
aUerwards,
but
is
op=onal
• Each
piece
is
simple,
but
can
be
powerful
when
combined
- 27. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
MapReduce
Benefits
• Scalability
• Hadoop
divides
the
processing
job
into
individual
tasks
• Tasks
execute
in
parallel
(independently)
across
cluster
• Simplicity
• Processes
one
record
at
a
=me
• Ease
of
use
• Hadoop
provides
job
scheduling
and
other
infrastructure
• Far
simpler
for
developers
than
typical
distributed
compu=ng
- 28. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
MapReduce
in
Hadoop
• MapReduce
processing
in
Hadoop
is
batch-‐oriented
• A
MapReduce
job
is
broken
down
into
smaller
tasks
• Tasks
run
concurrently
• Each
processes
a
small
amount
of
overall
input
• MapReduce
code
for
Hadoop
is
usually
wriNen
in
Java
• This
uses
Hadoop’s
API
directly
• You
can
do
basic
MapReduce
in
other
languages
• Using
the
Hadoop
Streaming
wrapper
program
• Some
advanced
features
require
Java
code
- 29. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
MapReduce
Example
in
Python
• The
following
example
uses
Python
• Via
Hadoop
Streaming
• It
processes
log
files
and
summarizes
events
by
type
• I’ll
explain
both
the
data
flow
and
the
code
- 30. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Job
Input
• Here’s
the
job
input
• Each
map
task
gets
a
chunk
of
this
data
to
process
• Typically
corresponds
to
a
single
block
in
HDFS
2013-06-29 22:16:49.391 CDT INFO "This can wait"
2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"
2013-06-29 22:16:54.276 CDT WARN "This seems bad"
2013-06-29 22:16:57.471 CDT INFO "More blather"
2013-06-29 22:17:01.290 CDT WARN "Not looking good"
2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"
2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
- 31. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
#!/usr/bin/env python
import sys
levels = ['TRACE', 'DEBUG', 'INFO',
'WARN', 'ERROR', 'FATAL']
for line in sys.stdin:
fields = line.split()
level = fields[3].upper()
if level in levels:
print "%st1" % level
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Python
Code
for
Map
Func=on
If
it
matches
a
known
level,
print
it,
a
tab
separator,
and
the
literal
value
1
(since
the
level
can
only
occur
once
per
line)
Read
records
from
standard
input.
Use
whitespace
to
split
into
fields.
Define
list
of
known
log
levels
Extract
“level”
field
and
convert
to
uppercase
for
consistency.
- 32. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Output
of
Map
Func=on
• The
map
func=on
produces
key/value
pairs
as
output
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
- 33. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
The
“Shuffle
and
Sort”
• Hadoop
automa9cally
merges,
sorts,
and
groups
map
output
• The
result
is
passed
as
input
to
the
reduce
func=on
• More
on
this
later…
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Shuffle
and
Sort
Map
Output
Reduce
Input
- 34. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Input
to
Reduce
Func=on
• Reduce
func=on
receives
a
key
and
all
values
for
that
key
• Keys
are
always
passed
to
reducers
in
sorted
order
• Although
not
obvious
here,
values
are
unordered
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
- 35. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Python
Code
for
Reduce
Func=on
#!/usr/bin/env python
import sys
previous_key = None
sum = 0
for line in sys.stdin:
key, value = line.split()
if key == previous_key:
sum = sum + int(value)
# continued on next slide
1
2
3
4
5
6
7
8
9
10
11
12
13
Ini=alize
loop
variables
Extract
the
key
and
value
passed
via
standard
input
If
key
unchanged,
increment
the
count
- 36. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Python
Code
for
Reduce
Func=on
# continued from previous slide
else:
if previous_key:
print '%st%i' % (previous_key, sum)
previous_key = key
sum = 1
print '%st%i' % (previous_key, sum)
14
15
16
17
18
19
20
21
22 Print
data
for
the
final
key
If
key
changed,
print
data
for
old
level
Start
tracking
data
for
the
new
record
- 37. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Output
of
Reduce
Func=on
• Its
output
is
a
sum
for
each
level
ERROR 1
INFO 4
WARN 2
- 38. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Recap
of
Data
Flow
ERROR 1
INFO 4
WARN 2
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Map
input
Map
output
Reduce
input
Reduce
output
2013-06-29 22:16:49.391 CDT INFO "This can wait"
2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"
2013-06-29 22:16:54.276 CDT WARN "This seems bad"
2013-06-29 22:16:57.471 CDT INFO "More blather"
2013-06-29 22:17:01.290 CDT WARN "Not looking good"
2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"
2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
Shuffle
and
sort
- 39. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
How
to
Run
a
Hadoop
Streaming
Job
• I’ll
demonstrate
this
now…
- 40. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Open
Source
Tools
that
Complement
Hadoop
The
Hadoop
Ecosystem
- 41. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
The
Hadoop
Ecosystem
• "Core
Hadoop"
consists
of
HDFS
and
MapReduce
• These
are
the
kernel
of
a
much
broader
plauorm
• Hadoop
has
many
related
projects
• Some
help
you
integrate
Hadoop
with
other
systems
• Others
help
you
analyze
your
data
• These
are
not
considered
“core
Hadoop”
• Rather,
they’re
part
of
the
Hadoop
ecosystem
• Many
are
also
open
source
Apache
projects
- 42. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Visual
Overview
of
a
Complete
Workflow
Import Transaction Data
from RDBMSSessionize Web
Log Data with Pig
Analyst uses Impala for
business intelligence
Sentiment Analysis on
Social Media with Hive
Hadoop Cluster
with Impala
Generate Nightly Reports
using Pig, Hive, or Impala
Build product
recommendations for
Web site
- 43. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Key
Points
• We’re
genera=ng
massive
volumes
of
data
• This
data
can
be
extremely
valuable
• Companies
can
now
analyze
what
they
previously
discarded
• Hadoop
supports
large-‐scale
data
storage
and
processing
• Heavily
influenced
by
Google's
architecture
• Already
in
produc=on
by
thousands
of
organiza=ons
• HDFS
is
Hadoop's
storage
layer
• MapReduce
is
Hadoop's
processing
framework
• Many
ecosystem
projects
complement
Hadoop
• Some
help
you
to
integrate
Hadoop
with
exis=ng
systems
• Others
help
you
analyze
the
data
you’ve
stored
- 44. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Highly
Recommended
Books
Author:
Tom
White
ISBN:
1-‐449-‐31152-‐0
Author:
Eric
Sammer
ISBN:
1-‐449-‐32705-‐2
- 45. ©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved
Ques=ons?
• Thank
you
for
aNending!
• I’ll
be
happy
to
answer
any
addi=onal
ques=ons
now…
• Demo
and
slides
at
github.com/markgrover/hadoop-‐intro-‐fast
• TwiNer:
mark_grover
• Survey
page:
=ny.cloudera.com/mark