Spark is a fast and general engine for large-scale data processing. It improves on MapReduce by allowing iterative algorithms through in-memory caching and by supporting interactive queries. Spark features include in-memory caching, general execution graphs, APIs in multiple languages, and integration with Hadoop. It is faster than MapReduce, supports iterative algorithms needed for machine learning, and enables interactive data analysis through its flexible execution model.
3. MapReduce has been around for a while
It made distributed compute easier
But, can we do better?
3
4. MapReduce Issues
• Launching mappers and reducers takes time
• One MR job can rarely do a full computation
• Writing to disk (in triplicate!) between each job
• Going back to queue between jobs
• No in-memory caching
• No iterations
• Very high latency
• Not the greatest APIs either
4
6. Spark Features
• In-memory cache
• General execution graphs
• APIs in Scala, Java and Python
• Integrates but does not depend on Hadoop
6
7. Why is it better?
• (Much) Faster than MR
• Iterative programming – Must have for ML
• Interactive – allows rapid exploratory analytics
• Flexible execution graph:
• Map, map, reduce, reduce, reduce, map
• High productivity compared to MapReduce
7
13. How Spark runs on a Cluster
13
Driver
Worker
Worker
Data
RAM
Data
RAMWorker
Data
RAM
14. Workflow
• SparkContext in driver connects to Master
• Master allocates resources for app on cluster
• SC acquires executors on worker nodes
• SC sends the app code (JAR) to executors
• SC sends tasks to executors
14
19. Lazy transformation
19
Find all lines that mention “MySQL”
Only the timestamp portion of the line
Set the date and hour as key, 1 as value
Now reduce by key and sum the values
Return the result as Array so I can print
Find lines, get
timestamp…
Aha! Finally
something
to do!
20. Persistence / Caching
• Store RDD in memory for later use
• Each node persists a partition
• Persist() marks an RDD for caching
• It will be cached first time an action is performed
• Use for iterative algorithms
20
27. Logistic Regression
• Read two sets of points
• Looks for a plane W that separates them
• Perform gradient descent:
• Start with random W
• On each iteration, sum a function of W over the data
• Move W in a direction that improves it
27
29. Logistic Regression
val points = spark.textFile(…).map(parsePoint).cache()
val w = Vector.random(D)
for (I <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )
.reduce(_+_)
w -= gradient
}
println(“Final separating plane: ” + w)
29
30. Conviva Use-Case
• Monitor online video consumption
• Analyze trends
Need to run tens of queries like this a day:
SELECT videoName, COUNT(1)
FROM summaries
WHERE date='2011_12_12' AND customer='XYZ'
GROUP BY videoName;
30
31. Conviva With Spark
val sessions =
sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSum
maryOnHdfs)
val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache
val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }
val reduceFn : (Long, Long) => Long = { (a,b) => a+b }
val results =
cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
31
36. DStreams
• Stream is broken down into micro-batches
• Each micro-batch is an RDD
• This means any Spark function or library can apply to
a stream
• Including ML-Lib, graph processing, etc.
36
40. Dstream Operators
• Transformation
produce DStream from one or more parent streams
• Stateless (independent per interval)
Map, reduce
• Stateful (share data across intervals)
Window, incremental aggregation, time-
skewed join
• Output
Write data to external system (save RDD to HDFS)
Save, foreach
40
41. Fault Recovery
• Input from TCP, Flume or Kafka is stored on 2 nodes
• In case of failure:
missing RDDs will be re-computed from surviving nodes.
• RDDs are deterministic
• So any computation will lead to the same result
• Transformation can guarantee
exactly once semantics.
• Even through failure
41
43. Example – Streaming WordCount
import org.apache.spark.streaming.{Seconds, StreamingContext}
import StreamingContext._
...
// Create the context and set up a network input stream
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))
val lines = ssc.socketTextStream(args(1), args(2).toInt)
// Split the lines into words, count them
// print some of the counts on the master
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
43
49. Spark + Shark Integration
val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")
val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")), ...)}
val trainedVector = logRegress(features.cache())
49
Transformations create new datasets from existing ones.Actions return value
When you apply a transformation to an RDD, they don’t happen right away. Instead they are remembered and are computed only when an action requires a result.
Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.
Count number of words received from network server every secondThe socketTextStream returns a DStream of lines received from a TCP socket-based source. The lines DStream is transformed into a DStream using the flatMap operation, where each line is split into words. This words DStream is then mapped to a DStream of (word, 1) pairs, which is finally reduced to get the word counts. wordCounts.print() will print 10 of the counts generated every second.