5. Hadoop
as
an
Architecture
The
Old
Way
$30,000+
per
TB
Expensive
&
UnaNainable
• Hard
to
scale
• Network
is
a
boNleneck
• Only
handles
rela6onal
data
• Difficult
to
add
new
fields
&
data
types
Expensive,
Special
purpose,
“Reliable”
Servers
Expensive
Licensed
SoRware
Network
Data
Storage
(SAN,
NAS)
Compute
(RDBMS,
EDW)
The
Hadoop
Way
$300-‐$1,000
per
TB
Affordable
&
ANainable
• Scales
out
forever
• No
boNlenecks
• Easy
to
ingest
any
data
• Agile
data
access
Commodity
“Unreliable”
Servers
Hybrid
Open
Source
SoRware
Compute
(CPU)
Memory
Storage
(Disk)
z
z
6. CDH:
the
App
Store
for
Hadoop
6
Integra6on
Storage
Resource
Management
Metadata
NoSQL
DBMS
…
AnalyYc
MPP
DBMS
Search
Engine
In-‐
Memory
Batch
Processing
System
Management
Data
Management
Support
Security
Machine
Learning
MapReduce
8. Can
we
improve
on
MR?
• Problems
with
MR:
• Very
low-‐level:
requires
a
lot
of
code
to
do
simple
things
• Very
constrained:
everything
must
be
described
as
“map”
and
“reduce”.
Powerful
but
some6mes
difficult
to
think
in
these
terms.
8
9. Can
we
improve
on
MR?
• Two
approaches
to
improve
on
MapReduce:
1. Special
purpose
systems
to
solve
one
problem
domain
well.
• Giraph
/
Graphlab
(graph
processing)
• Storm
(stream
processing)
2. Generalize
the
capabili6es
of
MapReduce
to
provide
a
richer
founda6on
to
solve
problems.
• Tez,
MPI,
Hama/Pregel
(BSP),
Dryad
(arbitrary
DAGs)
Both
are
viable
strategies
depending
on
the
problem!
9
10. What
is
Apache
Spark?
Spark
is
a
general
purpose
computa6onal
framework
Retains
the
advantages
of
MapReduce:
• Linear
scalability
• Fault-‐tolerance
• Data
Locality
based
computa6ons
…but
offers
so
much
more:
• Leverages
distributed
memory
for
beNer
performance
• Supports
itera6ve
algorithms
that
are
not
feasible
in
MR
• Improved
developer
experience
• Full
Directed
Graph
expressions
for
data
parallel
computa6ons
• Comes
with
libraries
for
machine
learning,
graph
analysis,
etc
10
11. Gecng
started
with
Spark
• Java
API
• Interac6ve
shells:
• Scala
(spark-‐shell)
• Python
(pyspark)
11
12. Execu6on
modes
• Standalone
Mode
• Dedicated
master
and
worker
daemons
• YARN
Client
Mode
• Launches
a
YARN
applica6on
with
the
driver
program
running
locally
• YARN
Cluster
Mode
• Launches
a
YARN
applica6on
with
the
driver
program
running
in
the
YARN
Applica6onMaster
12
Dynamic
resource
management
between
Spark,
MR,
Impala…
Dedicated
Spark
run6me
with
sta6c
resource
limits
14. Parallelized
Collec6ons
14
scala>
val
data
=
1
to
5
data:
Range.Inclusive
=
Range(1,
2,
3,
4,
5)
scala>
val
distData
=
sc.parallelize(data)
distData:
org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[0]
Now
I
can
apply
parallel
opera6ons
to
this
array:
scala>
distData.reduce(_
+
_)
[…
Adding
task
set
0.0
with
56
tasks
…]
res0:
Int
=
15
What
just
happened?!
15. RDD
–
Resilient
Distributed
Dataset
• Collec6ons
of
objects
par66oned
across
a
cluster
• Stored
in
RAM
or
on
Disk
• You
can
control
persistence
and
par66oning
• Created
by:
• Distribu6ng
local
collec6on
objects
• Transforma6on
of
data
in
storage
• Transforma6on
of
RDDs
• Automa6cally
rebuilt
on
failure
(resilient)
• Contains
lineage
to
compute
from
storage
• Lazy
materializa6on
15
20. Word
Count
in
MapReduce
20
package
org.myorg;
import
java.io.IOExcep6on;
import
java.u6l.*;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.conf.*;
import
org.apache.hadoop.io.*;
import
org.apache.hadoop.mapreduce.*;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public
class
WordCount
{
public
sta6c
class
Map
extends
Mapper<LongWritable,
Text,
Text,
IntWritable>
{
private
final
sta6c
IntWritable
one
=
new
IntWritable(1);
private
Text
word
=
new
Text();
public
void
map(LongWritable
key,
Text
value,
Context
context)
throws
IOExcep6on,
InterruptedExcep6on
{
String
line
=
value.toString();
StringTokenizer
tokenizer
=
new
StringTokenizer(line);
while
(tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
context.write(word,
one);
}
}
}
public
sta6c
class
Reduce
extends
Reducer<Text,
IntWritable,
Text,
IntWritable>
{
public
void
reduce(Text
key,
Iterable<IntWritable>
values,
Context
context)
throws
IOExcep6on,
InterruptedExcep6on
{
int
sum
=
0;
for
(IntWritable
val
:
values)
{
sum
+=
val.get();
}
context.write(key,
new
IntWritable(sum));
}
}
public
sta6c
void
main(String[]
args)
throws
Excep6on
{
Configura6on
conf
=
new
Configura6on();
Job
job
=
new
Job(conf,
"wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job,
new
Path(args[0]));
FileOutputFormat.setOutputPath(job,
new
Path(args[1]));
job.waitForComple6on(true);
}
}
21. Word
Count
in
Spark
!
sc.textFile(“words”)!
.flatMap(line => line.split(" "))!
.map(word=>(word,1))!
.reduceByKey(_+_).collect()!
!
!
21
22. Logis6c
Regression
• Read
two
sets
of
points
• Looks
for
a
plane
W
that
separates
them
• Perform
gradient
descent:
• Start
with
random
W
• On
each
itera6on,
sum
a
func6on
of
W
over
the
data
• Move
W
in
a
direc6on
that
improves
it
22
30. 30
Integra6on
Storage
Resource
Management
Metadata
HBase
…
Impala
Solr
Spark
Map
Reduce
System
Management
Data
Management
Support
Security
31. Spark
Streaming
• Takes
the
concept
of
RDDs
and
extends
it
to
DStreams
• Fault-‐tolerant
like
RDDs
• Transformable
like
RDDs
• Adds
new
“rolling
window”
opera6ons
• Rolling
averages,
etc
• But
keeps
everything
else!
• Regular
Spark
code
works
in
Spark
Streaming
• Can
s6ll
access
HDFS
data,
etc
31
34. Fault
Recovery
• RDDs
store
dependency
graph
• Because
RDDs
are
determinis6c:
Missing
RDDs
are
rebuilt
in
parallel
on
other
nodes
• Stateful
RDDs
can
have
infinite
lineage
• Periodic
checkpoints
to
disk
clears
lineage
• Faster
recovery
6mes
• BeNer
handling
of
stragglers
vs
row-‐by-‐row
streaming
34
39. A
Brief
History
39
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Doug
Cu`ng
launches
Nutch
project
Google
releases
GFS
paper
Google
releases
MapReduce
paper
MapReduce
implemented
in
Nutch
Nutch
adds
distributed
file
system
Hadoop
spun
out
of
Nutch
project
Hadoop
breaks
Terasort
world
record
Cloudera
founded
CDH
and
CDH2
released
CDH3
released
CDH4
released
adding
HA
Impala
(SQL
on
Hadoop)
launched
Sentry
and
Search
launched
CDH5
Cloudera
Manager
released
HBase,
Zookeeper,
Flume
and
more
added
to
CDH
40. What
is
Apache
Hadoop?
• An
open-‐source
implementa6on
of
Google’s
GFS
and
MapReduce
papers
• An
Apache
So~ware
Founda6on
top-‐level
project
• Good
at
storing
and
processing
all
kinds
of
data
• Reliable
storage
at
terabyte/petabyte-‐scale
on
unreliable
(cheap)
hardware
• A
distributed
system
for
coun6ng
words
J
40
41. What
is
Apache
Hadoop?
41
Has
the
Flexibility
to
Store
and
Mine
Any
Type
of
Data
§ Ask
ques6ons
across
structured
and
unstructured
data
that
were
previously
impossible
to
ask
or
solve
§ Not
bound
by
a
single
schema
Excels
at
Processing
Complex
Data
§ Scale-‐out
architecture
divides
workloads
across
mul6ple
nodes
§ Flexible
file
system
eliminates
ETL
boNlenecks
Scales
Economically
§ Can
be
deployed
on
industry
standard
hardware
§ Open
source
pla•orm
guards
against
vendor
lock
Hadoop
Distributed
File
System
(HDFS)
Self-‐Healing,
High
Bandwidth
Clustered
Storage
MapReduce
Distributed
Compu6ng
Framework
Apache Hadoop
is
an
open
source
pla•orm
for
data
storage
and
processing
that
is…
ü Scalable
ü Fault
tolerant
ü Distributed
CORE
HADOOP
SYSTEM
COMPONENTS