In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.
Machine Learning Software Engineering Patterns and Their Engineering
Spark streaming state of the union
1. Spark Streaming
The State of the Union and the Road Beyond
Tathagata “TD” Das
@tathadas
March 18, 2015
2. Who am I?
Project Management Committee (PMC) member of Spark
Lead developer of Spark Streaming
Formerly in AMPLab, UC Berkeley
Software developer at Databricks
4. Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis
HDFS/S3
Kafka
Twitter
High-level API
joins, windows, …
often 5x less code
Fault-tolerant
Exactly-once semantics,
even for stateful ops
Integration
Integrate with MLlib, SQL,
DataFrames, GraphX
5. How does it work?
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
5
data streams
receivers
batches results
6. Streaming Word Count with Kafka
val
kafka
=
KafkaUtils.create(ssc,
kafkaParams,
…)
val
words
=
kafka.map(_._2).flatMap(_.split("
"))
val
wordCounts
=
words.map(x
=>
(x,
1))
.reduceByKey(_
+
_)
wordCounts.print()
ssc.start()
6
print some counts on screen
count the words
split lines into words
create DStream
with lines from Kafka
start processing the stream
9. Combine batch and streaming processing
Join data streams with static data sets
//
Create
data
set
from
Hadoop
file
val
dataset
=
sparkContext.hadoopFile(“file”)
//
Join
each
batch
in
stream
with
the
dataset
kafkaStream.transform
{
batchRDD
=>
batchRDD.join(dataset)filter(...)
}
9
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
10. Combine machine learning with streaming
Learn models offline, apply them online
//
Learn
model
offline
val
model
=
KMeans.train(dataset,
...)
//
Apply
model
online
on
stream
kafkaStream.map
{
event
=>
model.predict(event.feature)
}
10
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
11. Combine SQL with streaming
Interactively query streaming data with SQL
//
Register
each
batch
in
stream
as
table
kafkaStream.map
{
batchRDD
=>
batchRDD.registerTempTable("latestEvents")
}
//
Interactively
query
table
sqlContext.sql("select
*
from
latestEvents")
11
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
12. A Brief History
12
Late 2011 – research idea
AMPLab, UC Berkeley
We need to
make Spark
faster
Okay...umm,
how??!?!
13. A Brief History
13
Q2 2012 – prototype
Rewrote large parts of Spark core
Smallest job - 900 ms à <50 ms
Q3 2012
Spark core improvements
open sourced in Spark 0.6
Feb 2013 – Alpha release
7.7k lines, merged in 7 days
Released with Spark 0.7
Late 2011 – idea
AMPLab, UC Berkeley
14. A Brief History
14
Late 2011 – idea
AMPLab, UC Berkeley
Q2 2012 – prototype
Rewrote large parts of Spark core
Smallest job - 900 ms à <50 ms
Q3 2012
Spark core improvements
open sourced in Spark 0.6
Feb 2013 – Alpha release
7.7k lines, merged in 7 days
Released with Spark 0.7
Jan 2014 – Stable release
Graduation with Spark 0.9
18. Python API
Core functionality in Spark 1.2,
with sockets and files as
sources
Kafka support in Spark 1.3
Other sources coming in future
18
kafka
=
KafkaUtils.createStream(ssc,
params,
…)
lines
=
kafka.map(lambda
x:
x[1])
counts
=
lines.flatMap(lambda
line:
line.split("
"))
.map(lambda
word:
(word,
1))
.reduceByKey(lambda
a,
b:
a+b)
counts.pprint()
19. Streaming MLlib algorithms
val
model
=
new
StreamingKMeans()
.setK(10)
.setDecayFactor(1.0)
.setRandomCenters(4,
0.0)
//
Apply
model
to
DStreams
model.trainOn(trainingDStream)
model.predictOnValues(
testDStream.map
{
lp
=>
(lp.label,
lp.features)
}
).print()
19
Continuous learning and prediction on
streaming data
StreamingLinearRegression in Spark 1.1
StreamingKMeans in Spark 1.2
StreamingLogisticRegression in Spark 1.3
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
20. Kafka `Direct` Stream API
Earlier Receiver-based approach for Kafka
Requires replicated journals (write ahead logs) to ensure
zero data loss under driver failures
20
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Kafka Receiver
high-level
consumer
21. Kafka `Direct` Stream API
Earlier Receiver-based approach for Kafka
New direct approach for Kafka in Spark 1.3
21
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Kafka Receiver
high-level
consumer
simple consumer API to
read Kafka topics
22. Kafka `Direct` Stream API
New direct approach for Kafka in 1.3 – treat Kafka like a file system
No receivers!!!
Directly query Kafka for latest topic offsets, and read data like reading files
Instead of Zookeeper, Spark Streaming keeps track of Kafka offsets
More efficient, fault-tolerant, exactly-once receiving of Kafka data
22
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
27. Spark Packages
More contributions from the
community in spark-packages
Alternate Kafka receiver
Apache Camel receiver
Cassandra examples
http://spark-packages.org/
27
29. Spark Summit 2014 Survey
29
40% of Spark users were
using Spark Streaming in
production or prototyping
Another 39% were
evaluating it
Not using
21%
Evaluating
39%
Prototyping
31%
Production
9%
32. Intel China builds big data solutions for large enterprises
Multiple streaming applications for top businesses
Real-time risk analysis for a top online payment company
Real-time deal and flow metric reporting for a top online shopping company
33. Complicated stream processing
SQL queries on streams
Join streams with large historical datasets
> 1TB/day passing through Spark Streaming
YARN
Spark
Streaming
Kafka
RocketMQ
HBase
34. One of the largest publishing and education company, wants
to accelerate their push into digital learning
Needed to combine student activities and domain events to
continuously update the learning model of each student
Earlier implementation in Storm, but now moved on to
Spark Streaming
36. Leading advertising automation company with an exchange
platform for in-feed ads
Process clickstream data for optimizing real-time bidding for ads
Mesos+Marathon
Spark
Streaming
Kinesis MySQL
Redis
RabbitMQ SQS
37. Wants to learn trending movies and shows in real time
Currently in the middle of replacing one of their internal
stream processing architecture with Spark Streaming
Tested resiliency of Spark Streaming with Chaos Monkey
More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
38. Driver failures handled with Spark
Standalone cluster’s supervise mode
Worker, executor and receiver failures
automatically handled
Spark Streaming can handle all kinds of failures
More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
40. Neuroscience @ Freeman Lab, Janelia Farm
Streaming machine learning
algorithms on time series data of
every neuron
Upto 2TB/hour and increasing with
brain size
Upto 80 HPC nodes
http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
41. Why are they adopting Spark Streaming?
Easy, high-level API
Unified API across batch and streaming
Integration with Spark SQL and MLlib
Ease of operations
41
47. You can help!
Roadmaps are heavily driven by community feedback
We have listened to community demands over the last year
Write Ahead Logs for zero data loss
New Kafka direct API
Let us know what do you want to see in Spark Streaming
Spark user mailing list, tweet it to me @tathadas
47
48. Industry adoption increasing rapidly
Community contributing very actively
More libraries, operational ease and
performance in the roadmap
48
@tathadas
50. Typesafe survey of Spark users
2136 developers, data scientists,
and other tech professionals
http://java.dzone.com/articles/apache-spark-survey-typesafe-0
51. Typesafe survey of Spark users
65% of Spark users are interested
in Spark Streaming
52. Typesafe survey of Spark users
2/3 of Spark users want to process
event streams
54. • Big data solution provider for enterprises
• Multiple applications for different businesses
- Monitoring +optimizing online services of Tier-1 bank
- Fraudulent transaction detection for Tier-2 bank
• Kafka à SS à Cassandra, MongoDB
• Built their own Stratio Streaming platform on
Spark Streaming, Kafka, Cassandra, MongoDB
55. • Provides data analytics solutions for Communication
Service Providers
- 4 of 5 top mobile ops, 3 of 4 top internet backbone providers
- Processes >50% of all US mobile traffic
• Multiple applications for different businesses
- Real-time anomaly detection in cell tower traffic
- Real-time call quality optimizations
• Kafka à SS
http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
56. • Runs claims processing applications for healthcare providers
http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims
• Predictive models can look
for claims that are likely to
be held up for approval
• Spark Streaming allows
model scoring in seconds
instead of hours