2. About
Me
• Diverse
roles/languages
and
pla=orms.
• Middleware
space
in
recent
years.
• Worked
for
IBM/Grid
Dynamics/GigaSpaces.
• Working
as
Systems
Engineer
for
Cloudera
since
last
July.
• Work
with
and
educate
clients/prospects.
2
6. A
brief
review
of
MapReduce
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Reduce
Reduce
Key
advances
by
MapReduce:
• Data
Locality:
AutomaLc
split
computaLon
and
launch
of
mappers
appropriately
• Fault
tolerance:
Write
intermediate
results
and
restartable
mappers
means
ability
to
run
on
commodity
hardware
• Linear
scalability:
CombinaLon
of
locality
+
programming
model
that
forces
developers
to
write
generally
scalable
soluLons
to
problems
6
7. MapReduce
sufficient
for
many
classes
of
problems
MapReduce
Hive
Pig
Mahout
Crunch
Solr
A
bit
like
Haiku:
• Limited
expressivity
• But
can
be
used
to
approach
diverse
problem
domains
7
8. BUT…
Can
we
do
beUer?
Areas
ripe
for
improvement,
• Launching
Mappers/Reducers
takes
Lme
• Having
to
write
to
disk
(replicated)
between
each
step
• Reading
data
back
from
disk
in
the
next
step
• Each
Map/Reduce
step
has
to
go
back
into
the
queue
and
get
its
resources
• Not
In
Memory
• Cannot
iterate
fast
8
9. What
is
Spark?
Spark
is
a
general
purpose
computaLonal
framework
-‐
more
flexibility
than
MapReduce.
It
is
an
implementaLon
of
a
2010
Berkley
paper
[1].
Key
properBes:
• Leverages
distributed
memory
• Full
Directed
Graph
expressions
for
data
parallel
computaLons
• Improved
developer
experience
Yet
retains:
Linear
scalability,
Fault-‐tolerance
and
Data
Locality
based
computaLons
1
-‐
hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
9
10. Spark:
Easy
and
Fast
Big
Data
• Easy
to
Develop
– Rich
APIs
in
Java,
Scala,
Python
– InteracLve
shell
• Fast
to
Run
– General
execuLon
graphs
– In-‐memory
storage
2-‐5×
less
code
Up
to
10×
faster
on
disk,
100×
in
memory
10
11. Easy:
Get
Started
Immediately
• MulL-‐language
support
• InteracLve
Shell
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(s => s.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
11
18. Driver
and
Workers
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM
18
19. RDD
–
Resilient
Distributed
Dataset
• Read-‐only
parLLoned
collecLon
of
records
• Created
through:
– TransformaLon
of
data
in
storage
– TransformaLon
of
RDDs
• Contains
lineage
to
compute
from
storage
• Lazy
materializaLon
• Users
control
persistence
and
parLLoning
19
21. OperaLons
• TransformaBons
create
new
RDD
from
an
exisLng
one
• AcBons
run
computaLon
on
RDD
and
return
a
value
• TransformaLons
are
lazy.
• AcLons
materialize
RDDs
by
compuLng
transformaLons.
• RDDs
can
be
cached
to
avoid
re-‐compuLng.
21
22. Fault
Tolerance
• RDDs
contain
lineage.
• Lineage
–
source
locaLon
and
list
of
transformaLons
• Lost
parLLons
can
be
re-‐computed
from
source
data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS
File
Filtered
RDD
Mapped
RDD
filter
(func
=
startsWith(…))
map
(func
=
split(...))
22
23. Caching
• Persist()
and
cache()
mark
data
• RDD
is
cached
ater
first
acLon
• Fault
tolerant
–
lost
parLLons
will
re-‐compute
• If
not
enough
memory
–
some
parLLons
will
not
be
cached
• Future
acLons
are
performed
on
cached
parLLoned
• So
they
are
much
faster
Use
caching
for
iteraBve
algorithms
23
31. LogisLc
Regression
• Read
two
sets
of
points
• Looks
for
a
plane
W
that
separates
them
• Perform
gradient
descent:
– Start
with
random
W
– On
each
iteraLon,
sum
a
funcLon
of
W
over
the
data
– Move
W
in
a
direcLon
that
improves
it
31
33. LogisLc
Regression
val points =
spark.textFile(…).map(parsePoint).cache()!
!
val w = Vector.random(D)!
!
for (I <- 1 to ITERATIONS) {!
"val gradient = points.map(p => !
" "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )!
" ".reduce(_+_)!
"w -= gradient!
}!
println(“Final separating plane: ” + w)!
33
34. Conviva
Use-‐Case
[1]
• Monitor
online
video
consumpLon
• Analyze
trends
Need
to
run
tens
of
queries
like
this
a
day:
SELECT videoName, COUNT(1)
FROM summaries
WHERE date='2011_12_12' AND customer='XYZ'
GROUP BY videoName;
1
-‐
hUp://www.conviva.com/using-‐spark-‐and-‐hive-‐to-‐process-‐bigdata-‐at-‐conviva/
34
35. Conviva
With
Spark
val
sessions
=
sparkContext.sequenceFile[SessionSummary,NullWritable]
(pathToSessionSummaryOnHdfs)
val
cachedSessions
=
sessions.filter(whereCondiLonToFilterSessions).cache
val
mapFn
:
SessionSummary
=>
(String,
Long)
=
{
s
=>
(s.videoName,
1)
}
val
reduceFn
:
(Long,
Long)
=>
Long
=
{
(a,b)
=>
a+b
}
val
results
=
cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
35
40. Spark
Streaming
– Run
con$nuous
processing
of
data
using
Spark’s
core
API.
– Extends
Spark
concept
of
RDD’s
to
DStreams
(DiscreLzed
Streams)
which
are
fault
tolerant,
transformable
streams.
Users
can
re-‐use
exisLng
code
for
batch/offline
processing.
– Adds
“rolling
window”
operaLons.
E.g.
compute
rolling
averages
or
counts
for
data
over
last
five
minutes.
– Example
use
cases:
• “On-‐the-‐fly”
ETL
as
data
is
ingested
into
Hadoop/HDFS.
• DetecLng
anomalous
behavior
and
triggering
alerts.
• ConLnuous
reporLng
of
summary
metrics
for
incoming
data.
40
41. val
tweets
=
ssc.twitterStream()
val
hashTags
=
tweets.flatMap
(status
=>
getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch
@
t+1
batch
@
t
batch
@
t+2
tweets
DStream
hashTags
DStream
Stream
composed
of
small
(1-‐10s)
batch
computaLons
“Micro-‐batch”
Architecture
41
45. Shark
Vs
Impala
• Shark
inherits
Hive
limitaLons
while
Impala
is
purpose
built
for
SQL.
• Impala
is
significantly
faster
per
our
tests.
• Shark
does
not
have
security,
audit/lineage,
support
for
high-‐concurrency,
operaLonal
tooling
for
config/monitor/reporLng/
debugging.
• InteracLve
SQL
needed
for
connecLng
BI
Tools.
Shark
not
cerLfied
by
any
BI
vendor.
45
48. Why
Spark?
• Flexible
like
MapReduce
• High
performance
• Machine
learning,
iteraLve
algorithms
• InteracLve
data
exploraLons
• Developer
producLvity
48
49. How
Spark
Works?
• RDDs
–
resilient
distributed
data
• Lazy
transformaLons
• Caching
• Fault
tolerance
by
storing
lineage
• Streams
–
micro-‐batches
of
RDDs
• Shark
–
Hive
+
Spark
49
Notes de l'éditeur
* MapReduce struggles from performance optimization for individual systems because of its design* Google has used both techniques in-house quite a bit and the future will contain both
Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.