Introduction to Spark - Next-Generation MapReduce

1
Introduction to Spark
Gwen Shapira, Solutions Architect

Spark is next-generation Map Reduce
2

MapReduce has been around for a while
It made distributed compute easier
But, can we do better?
3

MapReduce Issues
• Launching mappers and reducers takes time
• One MR job can rarely do a full computation
• Writing to disk (in triplicate!) between each job
• Going back to queue between jobs
• No in-memory caching
• No iterations
• Very high latency
• Not the greatest APIs either
4

Spark:
Easy to Develop, Fast to Run
5

Spark Features
• In-memory cache
• General execution graphs
• APIs in Scala, Java and Python
• Integrates but does not depend on Hadoop
6

Why is it better?
• (Much) Faster than MR
• Iterative programming – Must have for ML
• Interactive – allows rapid exploratory analytics
• Flexible execution graph:
• Map, map, reduce, reduce, reduce, map
• High productivity compared to MapReduce
7

Word Count
file = spark.textFile(“hdfs://…”)
file.flatMap(line = > line.split(“ “))
.map(word=>(word,1))
.reduceByKey(_+_)
Remember MapReduce WordCount?
8

Agenda
• Concepts
• Examples
• Streaming
• Summary
9

CDH5 (simplified)
12
HDFS + In-Memory Cache
YARN
Spark MR Impala
Spark Streaming ML Lib

How Spark runs on a Cluster
13
Driver
Worker
Worker
Data
RAM
Data
RAMWorker
Data
RAM

Workflow
• SparkContext in driver connects to Master
• Master allocates resources for app on cluster
• SC acquires executors on worker nodes
• SC sends the app code (JAR) to executors
• SC sends tasks to executors
14

RDD – Resilient Distributed Dataset
• Collection of elements
• Read-only
• Partitioned
• Fault-tolerant
• Supports parallel operations
15

RDD Types
• Parallelized Collection
• Parallelize(Seq)
• HDFS files
• Text, Sequence or any InputFormat
• Both support same operations
16

Operations
Transformations
• Map
• Filter
• Sample
• Join
• ReduceByKey
• GroupByKey
• Distinct
Actions
• Reduce
• Collect
• Count
• First, Take
• SaveAs
• CountByKey
17

Lazy transformation
19
Find all lines that mention “MySQL”
Only the timestamp portion of the line
Set the date and hour as key, 1 as value
Now reduce by key and sum the values
Return the result as Array so I can print
Find lines, get
timestamp…
Aha! Finally
something
to do!

Persistence / Caching
• Store RDD in memory for later use
• Each node persists a partition
• Persist() marks an RDD for caching
• It will be cached first time an action is performed
• Use for iterative algorithms
20

Caching – Storage Levels
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2…
21

Fault Tolerance
• Lost partitions can be re-computed from source data
• Because we remember all transformations
22
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))

Word Count
file = spark.textFile(“hdfs://…”)
file.flatMap(line = > line.split(“ “))
.map(word=>(word,1))
.reduceByKey(_+_)
Remember MapReduce WordCount?
24

Log Mining
• Load error messages from a log into memory
• Interactively search for patterns
25

Log Mining
lines = spark.textFile(“hdfs://…”)
errors = lines.filter(_.startsWith(“ERROR”)
messages = errors.map(_.split(„t‟)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
…
26
Base RDD
Transformed
RDD
Action

Logistic Regression
• Read two sets of points
• Looks for a plane W that separates them
• Perform gradient descent:
• Start with random W
• On each iteration, sum a function of W over the data
• Move W in a direction that improves it
27

Logistic Regression
val points = spark.textFile(…).map(parsePoint).cache()
val w = Vector.random(D)
for (I <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )
.reduce(_+_)
w -= gradient
}
println(“Final separating plane: ” + w)
29

Conviva Use-Case
• Monitor online video consumption
• Analyze trends
Need to run tens of queries like this a day:
SELECT videoName, COUNT(1)
FROM summaries
WHERE date='2011_12_12' AND customer='XYZ'
GROUP BY videoName;
30

Conviva With Spark
val sessions =
sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSum
maryOnHdfs)
val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache
val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }
val reduceFn : (Long, Long) => Long = { (a,b) => a+b }
val results =
cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
31

What is it?
• Extension of Spark API
• For high-throughput fault-tolerant processing
of live data streams
33

Sources & Outputs
• Kafka
• Flume
• Twitter
• JMS Queues
• TCP sockets
• HDFS
• Databases
• Dashboards
34

Architecture
35
Input
Streaming
Context
Spark
Context

DStreams
• Stream is broken down into micro-batches
• Each micro-batch is an RDD
• This means any Spark function or library can apply to
a stream
• Including ML-Lib, graph processing, etc.
36

Processing Dstreams - Stateless
38

Processing Dstreams - Stateful
39

Dstream Operators
• Transformation
produce DStream from one or more parent streams
• Stateless (independent per interval)
Map, reduce
• Stateful (share data across intervals)
Window, incremental aggregation, time-
skewed join
• Output
Write data to external system (save RDD to HDFS)
Save, foreach
40

Fault Recovery
• Input from TCP, Flume or Kafka is stored on 2 nodes
• In case of failure:
missing RDDs will be re-computed from surviving nodes.
• RDDs are deterministic
• So any computation will lead to the same result
• Transformation can guarantee
exactly once semantics.
• Even through failure
41

Key Question -
How fast can the system recover?
42

Example – Streaming WordCount
import org.apache.spark.streaming.{Seconds, StreamingContext}
import StreamingContext._
...
// Create the context and set up a network input stream
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))
val lines = ssc.socketTextStream(args(1), args(2).toInt)
// Split the lines into words, count them
// print some of the counts on the master
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
43

Shark Architecture
• Identical to Hive
• Same CLI, JDBC, SQL Parser, Metastore
• Replaced the optimizer, plan generator and the
execution engine.
• Added Cache Manager.
• Generate Spark code instead of Map Reduce
45

Hive Compatibility
• MetaStore
• HQL
• UDF / UDAF
• SerDes
• Scripts
46

Dynamic Query Plans
• Hive MetaData often lacks statistics
• Join types often requires hinting
• Shark gathers statistics per partition
• While materializing map output
• Partition sizes, record count, skew, histograms
• Alter plan accordingly
47

Columnar Memory Store
• Better compression
• CPU efficiency
• Cache Locality
48

Spark + Shark Integration
val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")
val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")), ...)}
val trainedVector = logRegress(features.cache())
49

Why Spark?
• Flexible
• High performance
• Machine learning,
iterative algorithms
• Interactive data
explorations
• Developer productivity
51

Why not Spark?
• Still immature
• Uses *lots* of memory
• Equivalent functionality
in Impala, Storm, etc
52

How Spark Works?
• RDDs – resilient distributed data
• Lazy transformations
• Fault tolerant caching
• Streams – micro-batches of RDDs
53

Introduction to Spark - Next-Generation MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Spark - Next-Generation MapReduce

Similar to Introduction to Spark - Next-Generation MapReduce (20)

More from Gwen (Chen) Shapira

More from Gwen (Chen) Shapira (20)

Recently uploaded

Recently uploaded (20)

Introduction to Spark - Next-Generation MapReduce

Editor's Notes