SlideShare une entreprise Scribd logo
1  sur  53
Télécharger pour lire hors ligne
spark streaming 
with C* 
jacek.lewandowski@datastax.com
…applies where you need 
near-realtime data analysis
Spark vs Spark Streaming 
zillions of bytes gigabytes per second 
static dataset 
stream of data
What can you do with it? 
applications sensors web mobile phones 
intrusion detection malfunction detection site analytics network metrics analysis 
fraud detection 
dynamic process 
optimisation 
recommendations location based ads 
log processing supply chain planning sentiment analysis spying
What can you do with it? 
applications sensors web mobile phones 
intrusion detection malfunction detection site analytics network metrics analysis 
fraud detection 
dynamic process 
optimisation 
recommendations location based ads 
log processing supply chain planning sentiment analysis spying
Almost 
Whatever 
Source 
You 
Want 
Almost 
Whatever 
Destination 
You 
Want
so, let’s see how it works
DStream - A continuous sequence 
of micro batches 
DStream 
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) 
Processing of DStream = Processing of μBatches, RDDs
9 8 7 6 5 4 3 2 1 Receiver Interface between different 
stream sources and Spark
9 8 7 6 5 4 3 2 1 Receiver 
Spark memory boundary 
Block Manager 
Interface between different 
stream sources and Spark
9 8 7 6 5 4 3 2 1 Receiver 
Spark memory boundary 
Block Manager 
Replication and 
building μBatches 
Interface between different 
stream sources and Spark
Spark memory boundary Block Manager
Spark memory boundary Block Manager 
Blocks of input data 
9 8 7 6 5 4 3 2 1
Spark memory boundary Block Manager 
Blocks of input data 
9 8 7 6 5 4 3 2 1 
μBatch made of blocks 
9 8 7 6 5 4 3 2 1
μBatch made of blocks 
9 8 7 6 5 4 3 2 1
μBatch made of blocks 
9 8 7 6 5 4 3 2 1 
Partition Partition Partition
μBatch made of blocks 
9 8 7 6 5 4 3 2 1 
Partition Partition Partition
Ingestion from multiple sources 
Receiving, 
μBatch building 
Receiving, 
μBatch building 
Receiving, 
μBatch building
Ingestion from multiple sources 
Receiving, 
μBatch building 
Receiving, 
μBatch building 
Receiving, 
μBatch building 
μBatch μBatch 
2s 1s 0s
A well-worn example 
• ingestion of text messages 
• splitting them into separate words 
• count the occurrence of words within 5 
seconds windows 
• save word counts from the last 5 seconds, 
every 5 second to Cassandra, and display the 
first few results on the console
how to do that ? 
well…
Yes, it is that easy 
case class WordCount(time: Long, word: String, count: Int) 
val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph} 
val words: DStream[String] = paragraphs.flatMap(_.split( """s+""")) 
val wordCounts: DStream[(String, Long)] = words.countByValue() 
val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => 
val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { 
case (word, count) => 
(count.toInt, WordCount(time.milliseconds, word, count.toInt)) 
} 
val topWordCountsRDD: RDD[WordCount] = mappedWordCounts 
.sortByKey(ascending = false).values 
) 
topWordsStream.saveToCassandra("meetup", "word_counts") 
topWordsStream.print()
DStream stateless operators 
(quick recap) 
• map 
• flatMap 
• filter 
• repartition 
• union 
• count 
• countByValue 
• reduce 
• reduceByKey 
• joins 
• cogroup 
• transform 
• transformWith
DStream[Bean].count() 
count 4 3 
1s 1s 1s 1s
DStream[Bean].count() 
count 4 3 
1s 1s 1s 1s
DStream[Orange].union(DStream[Apple]) 
union 
1s 1s
Other stateless operations 
• join(DStream[(K, W)]) 
• leftOuterJoin(DStream[(K, W)]) 
• rightOuterJoin(DStream[(K, W)]) 
• cogroup(DStream[(K, W)]) 
are applied on pairs of corresponding μBatches
transform, transformWith 
• DStream[T].transform(RDD[T] => RDD[U]): DStream[U] 
• DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V] 
allow you to create new stateless operators
DStream[Blue].transformWith 
(DStream[Red], …): DStream[Violet] 
1-A 2-A 3-A 
1-B 2-B 3-B 
1-A x 1-B 2-A x 2-B 3-A x 3-B
DStream[Blue].transformWith 
(DStream[Red], …): DStream[Violet] 
1-A 2-A 3-A 
1-B 2-B 3-B 
1-A x 1-B 2-A x 2-B 3-A x 3-B
DStream[Blue].transformWith 
(DStream[Red], …): DStream[Violet] 
1-A 2-A 3-A 
1-B 2-B 3-B 
1-A x 1-B 2-A x 2-B 3-A x 3-B
Windowing 
slide 
0s 1s 2s 3s 4s 5s 6s 7s 
By default: 
window = slide = μBatch duration 
window
Windowing 
slide 
0s 1s 2s 3s 4s 5s 6s 7s 
By default: 
window = slide = μBatch duration 
window
Windowing 
slide 
0s 1s 2s 3s 4s 5s 6s 7s 
By default: 
window = slide = μBatch duration 
window
Windowing 
slide 
window 
0s 1s 2s 3s 4s 5s 6s 7s 
The resulting DStream consists of 3 seconds μBatches 
! 
Each resulting μBatch overlaps the preceding one by 1 second
Windowing 
slide 
window 
0s 1s 2s 3s 4s 5s 6s 7s 
The resulting DStream consists of 3 seconds μBatches 
! 
Each resulting μBatch overlaps the preceding one by 1 second
Windowing 
slide 
window 
0s 1s 2s 3s 4s 5s 6s 7s 
The resulting DStream consists of 3 seconds μBatches 
! 
Each resulting μBatch overlaps the preceding one by 1 second
Windowing 
slide 
window 
1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8 
μBatch appears in output stream every 1s 
! 
It contains messages collected during 3s 
1s
Windowing 
slide 
window 
1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8 
μBatch appears in output stream every 1s 
! 
It contains messages collected during 3s 
1s
DStream window operators 
• window(Duration, Duration) 
• countByWindow(Duration, Duration) 
• reduceByWindow(Duration, Duration, (T, T) => T) 
• countByValueAndWindow(Duration, Duration) 
• groupByKeyAndWindow(Duration, Duration) 
• reduceByKeyAndWindow((V, V) => V, Duration, Duration)
Let’s modify the example 
• ingestion of text messages 
• splitting them into separate words 
• count the occurrence of words within 10 
seconds windows 
• save word counts from the last 10 seconds, 
every 2 second to Cassandra, and display the 
first few results on the console
Yes, it is still easy to do 
case class WordCount(time: Long, word: String, count: Int) 
val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph} 
val words: DStream[String] = paragraphs.flatMap(_.split( """s+""")) 
val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2)) 
val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => 
val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { 
case (word, count) => 
(count.toInt, WordCount(time.milliseconds, word, count.toInt)) 
} 
val topWordCountsRDD: RDD[WordCount] = mappedWordCounts 
.sortByKey(ascending = false).values 
) 
topWordsStream.saveToCassandra("meetup", "word_counts") 
topWordsStream.print()
DStream stateful operator 
• DStream[(K, V)].updateStateByKey 
(f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)] 
A 
1 
B 
2 
A 
3 
C 
4 
A 
5 
B 
6 
A 
7 
B 
8 
C 
9 
• R1 = f(Seq(1, 3, 5), Some(7)) 
• R2 = f(Seq(2, 6), Some(8)) 
• R3 = f(Seq(4), Some(9)) 
A 
R1 
B 
R2 
C 
R3
Total word count example 
case class WordCount(time: Long, word: String, count: Int) 
def update(counts: Seq[Long], state: Option[Long]): Option[Long] = { 
val sum = counts.sum 
Some(state.getOrElse(0L) + sum) 
} 
val totalWords: DStream[(String, Long)] = 
stream.map { case (_, paragraph) => paragraph} 
.flatMap(_.split( """s+""")) 
.countByValue() 
.updateStateByKey(update) 
val topTotalWordCounts: DStream[WordCount] = 
totalWords.transform((rdd, time) => 
rdd.map { case (word, count) => 
(count, WordCount(time.milliseconds, word, count.toInt)) 
}.sortByKey(ascending = false).values 
) 
topTotalWordCounts.saveToCassandra("meetup", "word_counts_total") 
topTotalWordCounts.print()
Obtaining DStreams 
• ZeroMQ 
• Kinesis 
• HDFS compatible file system 
• Akka actor 
• Twitter 
• MQTT 
• Kafka 
• Socket 
• Flume 
• …
Particular DStreams 
are available in separate modules 
GroupId ArtifactId Latest Version 
org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0 
org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-flume-sink_2.10 1.1.0 
org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
If something goes wrong…
Fault tolerance 
The sequence 
of transformations is known 
to Spark Streaming 
μBatches are replicated 
once they are received 
Lost data can be recomputed
But there are pitfalls 
• Spark replicates blocks, not single messages 
• It is up to a particular receiver to decide whether to form the block from a 
single message or to collect more messages before pushing the block 
• The data collected in the receiver before the block is pushed will be lost in 
case of failure of the receiver 
• Typical tradeoff - efficiency vs fault tolerance
Built-in receivers breakdown 
Pushing single 
messages 
Can do both Pushing whole blocks 
Kafka Akka RawNetworkReceiver 
Twitter Custom ZeroMQ 
Socket 
MQTT
Thank you ! 
Questions? 
! 
http://spark.apache.org/ 
https://github.com/datastax/spark-cassandra-connector 
http://cassandra.apache.org/ 
http://www.datastax.com/

Contenu connexe

Tendances

Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team ApachePatrick McFadin
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strataPatrick McFadin
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingVassilis Bekiaris
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark Summit
 

Tendances (20)

Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 

En vedette

Spark vs storm
Spark vs stormSpark vs storm
Spark vs stormTrong Ton
 
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...DataStax
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
The biodegradation of Polystyrene
The biodegradation of PolystyreneThe biodegradation of Polystyrene
The biodegradation of PolystyrenePat Pataranutaporn
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...DataStax
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedGuido Schmutz
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkVolker Hirsch
 

En vedette (18)

Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Cassandra and IoT
Cassandra and IoTCassandra and IoT
Cassandra and IoT
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
The biodegradation of Polystyrene
The biodegradation of PolystyreneThe biodegradation of Polystyrene
The biodegradation of Polystyrene
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms compared
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 

Similaire à Spark Streaming with Cassandra: Real-time Data Analysis

Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptrveiga100
 
The Dynamic Language is not Enough
The Dynamic Language is not EnoughThe Dynamic Language is not Enough
The Dynamic Language is not EnoughLukas Renggli
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptAbhijitManna19
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.pptsnowflakebatch
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingShidrokhGoudarzi1
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
 

Similaire à Spark Streaming with Cassandra: Real-time Data Analysis (20)

So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
The Dynamic Language is not Enough
The Dynamic Language is not EnoughThe Dynamic Language is not Enough
The Dynamic Language is not Enough
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata_spark_streaming.ppt
strata_spark_streaming.pptstrata_spark_streaming.ppt
strata_spark_streaming.ppt
 
strata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streamingstrata spark streaming strata spark streamingsrata spark streaming
strata spark streaming strata spark streamingsrata spark streaming
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
About time
About timeAbout time
About time
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 

Dernier

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 

Dernier (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 

Spark Streaming with Cassandra: Real-time Data Analysis

  • 1. spark streaming with C* jacek.lewandowski@datastax.com
  • 2. …applies where you need near-realtime data analysis
  • 3. Spark vs Spark Streaming zillions of bytes gigabytes per second static dataset stream of data
  • 4. What can you do with it? applications sensors web mobile phones intrusion detection malfunction detection site analytics network metrics analysis fraud detection dynamic process optimisation recommendations location based ads log processing supply chain planning sentiment analysis spying
  • 5. What can you do with it? applications sensors web mobile phones intrusion detection malfunction detection site analytics network metrics analysis fraud detection dynamic process optimisation recommendations location based ads log processing supply chain planning sentiment analysis spying
  • 6. Almost Whatever Source You Want Almost Whatever Destination You Want
  • 7.
  • 8.
  • 9. so, let’s see how it works
  • 10. DStream - A continuous sequence of micro batches DStream μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) Processing of DStream = Processing of μBatches, RDDs
  • 11. 9 8 7 6 5 4 3 2 1 Receiver Interface between different stream sources and Spark
  • 12. 9 8 7 6 5 4 3 2 1 Receiver Spark memory boundary Block Manager Interface between different stream sources and Spark
  • 13. 9 8 7 6 5 4 3 2 1 Receiver Spark memory boundary Block Manager Replication and building μBatches Interface between different stream sources and Spark
  • 14. Spark memory boundary Block Manager
  • 15. Spark memory boundary Block Manager Blocks of input data 9 8 7 6 5 4 3 2 1
  • 16. Spark memory boundary Block Manager Blocks of input data 9 8 7 6 5 4 3 2 1 μBatch made of blocks 9 8 7 6 5 4 3 2 1
  • 17. μBatch made of blocks 9 8 7 6 5 4 3 2 1
  • 18. μBatch made of blocks 9 8 7 6 5 4 3 2 1 Partition Partition Partition
  • 19. μBatch made of blocks 9 8 7 6 5 4 3 2 1 Partition Partition Partition
  • 20. Ingestion from multiple sources Receiving, μBatch building Receiving, μBatch building Receiving, μBatch building
  • 21. Ingestion from multiple sources Receiving, μBatch building Receiving, μBatch building Receiving, μBatch building μBatch μBatch 2s 1s 0s
  • 22. A well-worn example • ingestion of text messages • splitting them into separate words • count the occurrence of words within 5 seconds windows • save word counts from the last 5 seconds, every 5 second to Cassandra, and display the first few results on the console
  • 23. how to do that ? well…
  • 24. Yes, it is that easy case class WordCount(time: Long, word: String, count: Int) val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph} val words: DStream[String] = paragraphs.flatMap(_.split( """s+""")) val wordCounts: DStream[(String, Long)] = words.countByValue() val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values ) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()
  • 25. DStream stateless operators (quick recap) • map • flatMap • filter • repartition • union • count • countByValue • reduce • reduceByKey • joins • cogroup • transform • transformWith
  • 29. Other stateless operations • join(DStream[(K, W)]) • leftOuterJoin(DStream[(K, W)]) • rightOuterJoin(DStream[(K, W)]) • cogroup(DStream[(K, W)]) are applied on pairs of corresponding μBatches
  • 30. transform, transformWith • DStream[T].transform(RDD[T] => RDD[U]): DStream[U] • DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V] allow you to create new stateless operators
  • 31. DStream[Blue].transformWith (DStream[Red], …): DStream[Violet] 1-A 2-A 3-A 1-B 2-B 3-B 1-A x 1-B 2-A x 2-B 3-A x 3-B
  • 32. DStream[Blue].transformWith (DStream[Red], …): DStream[Violet] 1-A 2-A 3-A 1-B 2-B 3-B 1-A x 1-B 2-A x 2-B 3-A x 3-B
  • 33. DStream[Blue].transformWith (DStream[Red], …): DStream[Violet] 1-A 2-A 3-A 1-B 2-B 3-B 1-A x 1-B 2-A x 2-B 3-A x 3-B
  • 34. Windowing slide 0s 1s 2s 3s 4s 5s 6s 7s By default: window = slide = μBatch duration window
  • 35. Windowing slide 0s 1s 2s 3s 4s 5s 6s 7s By default: window = slide = μBatch duration window
  • 36. Windowing slide 0s 1s 2s 3s 4s 5s 6s 7s By default: window = slide = μBatch duration window
  • 37. Windowing slide window 0s 1s 2s 3s 4s 5s 6s 7s The resulting DStream consists of 3 seconds μBatches ! Each resulting μBatch overlaps the preceding one by 1 second
  • 38. Windowing slide window 0s 1s 2s 3s 4s 5s 6s 7s The resulting DStream consists of 3 seconds μBatches ! Each resulting μBatch overlaps the preceding one by 1 second
  • 39. Windowing slide window 0s 1s 2s 3s 4s 5s 6s 7s The resulting DStream consists of 3 seconds μBatches ! Each resulting μBatch overlaps the preceding one by 1 second
  • 40. Windowing slide window 1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8 μBatch appears in output stream every 1s ! It contains messages collected during 3s 1s
  • 41. Windowing slide window 1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8 μBatch appears in output stream every 1s ! It contains messages collected during 3s 1s
  • 42. DStream window operators • window(Duration, Duration) • countByWindow(Duration, Duration) • reduceByWindow(Duration, Duration, (T, T) => T) • countByValueAndWindow(Duration, Duration) • groupByKeyAndWindow(Duration, Duration) • reduceByKeyAndWindow((V, V) => V, Duration, Duration)
  • 43. Let’s modify the example • ingestion of text messages • splitting them into separate words • count the occurrence of words within 10 seconds windows • save word counts from the last 10 seconds, every 2 second to Cassandra, and display the first few results on the console
  • 44. Yes, it is still easy to do case class WordCount(time: Long, word: String, count: Int) val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph} val words: DStream[String] = paragraphs.flatMap(_.split( """s+""")) val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2)) val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values ) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()
  • 45. DStream stateful operator • DStream[(K, V)].updateStateByKey (f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)] A 1 B 2 A 3 C 4 A 5 B 6 A 7 B 8 C 9 • R1 = f(Seq(1, 3, 5), Some(7)) • R2 = f(Seq(2, 6), Some(8)) • R3 = f(Seq(4), Some(9)) A R1 B R2 C R3
  • 46. Total word count example case class WordCount(time: Long, word: String, count: Int) def update(counts: Seq[Long], state: Option[Long]): Option[Long] = { val sum = counts.sum Some(state.getOrElse(0L) + sum) } val totalWords: DStream[(String, Long)] = stream.map { case (_, paragraph) => paragraph} .flatMap(_.split( """s+""")) .countByValue() .updateStateByKey(update) val topTotalWordCounts: DStream[WordCount] = totalWords.transform((rdd, time) => rdd.map { case (word, count) => (count, WordCount(time.milliseconds, word, count.toInt)) }.sortByKey(ascending = false).values ) topTotalWordCounts.saveToCassandra("meetup", "word_counts_total") topTotalWordCounts.print()
  • 47. Obtaining DStreams • ZeroMQ • Kinesis • HDFS compatible file system • Akka actor • Twitter • MQTT • Kafka • Socket • Flume • …
  • 48. Particular DStreams are available in separate modules GroupId ArtifactId Latest Version org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0 org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7) org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7) org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7) org.apache.spark spark-streaming-flume-sink_2.10 1.1.0 org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7) org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
  • 49. If something goes wrong…
  • 50. Fault tolerance The sequence of transformations is known to Spark Streaming μBatches are replicated once they are received Lost data can be recomputed
  • 51. But there are pitfalls • Spark replicates blocks, not single messages • It is up to a particular receiver to decide whether to form the block from a single message or to collect more messages before pushing the block • The data collected in the receiver before the block is pushed will be lost in case of failure of the receiver • Typical tradeoff - efficiency vs fault tolerance
  • 52. Built-in receivers breakdown Pushing single messages Can do both Pushing whole blocks Kafka Akka RawNetworkReceiver Twitter Custom ZeroMQ Socket MQTT
  • 53. Thank you ! Questions? ! http://spark.apache.org/ https://github.com/datastax/spark-cassandra-connector http://cassandra.apache.org/ http://www.datastax.com/