SlideShare une entreprise Scribd logo
1  sur  57
Télécharger pour lire hors ligne
Apache Cassandra & Apache Spark 
for Time Series Data 
Patrick McFadin 
Chief Evangelist for Apache Cassandra, DataStax 
@PatrickMcFadin 
©2013 DataStax Confidential. Do not distribute without consent. 
1
Cassandra for Applications 
APACHE 
CASSANDRA
Cassandra is… 
• Shared nothing 
• Masterless peer-to-peer 
• Based on Dynamo
Scaling 
• Add nodes to scale 
• Millions Ops/s 
Cassandra HBase Redis MySQL 
THROUGHPUT OPS/SEC)
Uptime 
• Built to replicate 
• Resilient to failure 
• Always on 
NONE
Replication 
DC1 
10.0.0.1 
00-25 
10.0.0.4 
76-100 
10.0.0.2 
26-50 
10.0.0.3 
51-75 
DC1: RF=3 
DC2 
10.10.0.1 
00-25 
Asynchronous WAN Replication 
10.10.0.4 
76-100 
10.10.0.2 
26-50 
10.10.0.3 
51-75 
DC2: RF=3 
Client Insert Data 
Asynchronous Local Replication
Data Model 
• Familiar syntax 
• Collections 
• PRIMARY KEY for uniqueness 
CREATE TABLE videos ( 
videoid uuid, 
userid uuid, 
name varchar, 
description varchar, 
location text, 
location_type int, 
preview_thumbnails map<text,text>, 
tags set<varchar>, 
added_date timestamp, 
PRIMARY KEY (videoid) 
);
Data Model - User Defined Types 
• Complex data in one place 
• No multi-gets (multi-partitions) 
• Nesting! CREATE TYPE address ( 
street text, 
city text, 
zip_code int, 
country text, 
cross_streets set<text> 
);
Data Model - Updated 
• Now video_metadata is 
embedded in videos 
CREATE TYPE video_metadata ( 
height int, 
width int, 
video_bit_rate set<text>, 
encoding text 
); 
CREATE TABLE videos ( 
videoid uuid, 
userid uuid, 
name varchar, 
description varchar, 
location text, 
location_type int, 
preview_thumbnails map<text,text>, 
tags set<varchar>, 
metadata set <frozen<video_metadata>>, 
added_date timestamp, 
PRIMARY KEY (videoid) 
);
Data Model - Storing JSON 
{ 
"productId": 2, 
"name": "Kitchen Table", 
"price": 249.99, 
"description" : "Rectangular table with oak finish", 
"dimensions": { 
"units": "inches", 
"length": 50.0, 
"width": 66.0, 
"height": 32 
}, 
"categories": { 
{ 
"category" : "Home Furnishings" { 
"catalogPage": 45, 
"url": "/home/furnishings" 
}, 
{ 
"category" : "Kitchen Furnishings" { 
"catalogPage": 108, 
"url": "/kitchen/furnishings" 
} 
} 
} 
CREATE TYPE dimensions ( 
units text, 
length float, 
width float, 
height float 
); 
CREATE TYPE category ( 
catalogPage int, 
url text 
); 
CREATE TABLE product ( 
productId int, 
name text, 
price float, 
description text, 
dimensions frozen <dimensions>, 
categories map <text, frozen <category>>, 
PRIMARY KEY (productId) 
);
Why… 
Cassandra for Time Series? 
Spark as a great addition to Cassandra?
Example 1: Weather Station 
• Weather station collects data 
• Cassandra stores in sequence 
• Application reads in sequence
Use case 
• Get all data for one weather station 
• Get data for a single date and time 
• Get data for a range of dates and times 
• Store data per weather station 
• Store time series in order: first to last 
Needed Queries 
Data Model to support queries
Data Model 
• Weather Station Id and Time 
are unique 
• Store as many as needed 
CREATE TABLE temperature ( 
weather_station text, 
year int, 
month int, 
day int, 
hour int, 
temperature double, 
PRIMARY KEY ((weather_station),year,month,day,hour) 
); 
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) 
VALUES (‘10010:99999’,2005,12,1,7,-5.6); 
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) 
VALUES (‘10010:99999’,2005,12,1,8,-5.1); 
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) 
VALUES (‘10010:99999’,2005,12,1,9,-4.9); 
INSERT INTO temperature(weather_station,year,month,day,hour,temperature) 
VALUES (‘10010:99999’,2005,12,1,10,-5.3);
Storage Model - Logical View 
weather_station hour temperature 
2005:12:1:7 
-5.6 
2005:12:1:8 
-5.1 
2005:12:1:9 
-4.9 
SELECT weather_station,hour,temperature 
FROM temperature 
WHERE weatherstation_id=‘10010:99999’ 
AND year = 2005 AND month = 12 AND day = 1; 
10010:99999 
10010:99999 
10010:99999 
2005:12:1:10 
-5.3 
10010:99999
2005:12:1:12 
-5.4 
2005:12:1:11 
Storage Model - Disk Layout 
SELECT weather_station,hour,temperature 
FROM temperature 
WHERE weatherstation_id=‘10010:99999’ 
AND year = 2005 AND month = 12 AND day = 1; 
-5.1 -4.9 -5.3 -4.9 
2005:12:1:7 
-5.6 
2005:12:1:8 2005:12:1:9 
10010:99999 
2005:12:1:10 
Merged, Sorted and Stored Sequentially
Primary key relationship 
PRIMARY KEY (weatherstation_id,year,month,day,hour)
Primary key relationship 
PRIMARY KEY (weatherstation_id,year,month,day,hour) 
Partition Key
Primary key relationship 
PRIMARY KEY (weatherstation_id,year,month,day,hour) 
Partition Key Clustering Columns
Primary key relationship 
PRIMARY KEY (weatherstation_id,year,month,day,hour) 
Partition Key Clustering Columns 
10010:99999
Primary key relationship 
PRIMARY KEY (weatherstation_id,year,month,day,hour) 
Partition Key Clustering Columns 
2005:12:1:7 
-5.6 
10010:99999 
2005:12:1:8 2005:12:1:9 2005:12:1:10 
-5.1 -4.9 -5.3
Data Locality 
weatherstation_id=‘10010:99999’ ? 
1000 Node Cluster 
You are here!
Query patterns 
• Range queries 
• “Slice” operation on disk 
SELECT weatherstation,hour,temperature 
FROM temperature 
WHERE weatherstation_id=‘10010:99999' 
AND year = 2005 AND month = 12 AND day = 1 
AND hour >= 7 AND hour <= 10; 
Single seek on disk 
2005:12:1:12 
-5.4 
2005:12:1:11 
-5.1 -4.9 -5.3 -4.9 
2005:12:1:7 
-5.6 
2005:12:1:8 2005:12:1:9 
10010:99999 
2005:12:1:10 
Partition key for locality
Query patterns 
• Range queries 
• “Slice” operation on disk 
Sorted by event_time 
Programmers like this 
SELECT weatherstation,hour,temperature 
FROM temperature 
WHERE weatherstation_id=‘10010:99999' 
AND year = 2005 AND month = 12 AND day = 1 
AND hour >= 7 AND hour <= 10; 
weather_station hour temperature 
2005:12:1:7 
-5.6 
2005:12:1:8 
-5.1 
2005:12:1:9 
-4.9 
10010:99999 
10010:99999 
10010:99999 
2005:12:1:10 
-5.3 
10010:99999
A pache Spark
Apache Spark 
• 10x faster on disk,100x faster in memory than Hadoop MR 
• Works out of the box on EMR 
• Fault Tolerant Distributed Datasets 
• Batch, iterative and streaming analysis 
• In Memory Storage and Disk 
• Integrates with Most File and Storage Options 
Up to 100× faster 
(2-10× on disk) 
2-5× less code
Spark Components 
Spark SQL 
structured 
Spark Core 
Spark 
Streaming 
real-time 
MLlib 
machine learning 
GraphX 
graph
org.apache.spark.rdd.RDD 
Resilient Distributed Dataset (RDD) 
•Created through transformations on data (map,filter..) or other RDDs 
•Immutable 
•Partitioned 
•Reusable
RDD Operations 
•Transformations - Similar to scala collections API 
•Produce new RDDs 
•filter, flatmap, map, distinct, groupBy, union, zip, 
reduceByKey, subtract 
•Actions 
•Require materialization of the records to generate a value 
•collect: Array[T], count, fold, reduce..
Analytic 
Analytic 
Search 
RDD Operations 
Transformation 
Action
Collections and Files To RDD 
scala> val distData = sc.parallelize(Seq(1,2,3,4,5) 
distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e 
val distFile: RDD[String] = sc.textFile(“directory/*.txt”) 
val distFile = sc.textFile(“hdfs://namenode:9000/path/file”) 
val distFile = sc.sequenceFile(“hdfs://namenode:9000/path/file”)
Spark and Cassandra
Spark on Cassandra 
• Server-Side filters (where clauses) 
• Cross-table operations (JOIN, UNION, etc.) 
• Data locality-aware (speed) 
• Data transformation, aggregation, etc. 
• Natural Time Series Integration
Spark Cassandra Connector 
• Loads data from Cassandra to Spark 
• Writes data from Spark to Cassandra 
• Implicit Type Conversions and Object Mapping 
• Implemented in Scala (offers a Java API) 
• Open Source 
• Exposes Cassandra Tables as Spark RDDs + Spark DStreams
Spark Cassandra Connector 
C* 
User Application 
Spark-Cassandra Connector 
Cassandra C* C* 
C* 
Spark Executor 
C* Java (Soon Scala) Driver 
https://github.com/datastax/spark-cassandra-connector
Spark Cassandra Example 
val conf = new SparkConf(loadDefaults = true) 
.set("spark.cassandra.connection.host", "127.0.0.1") 
.setMaster("spark://127.0.0.1:7077") 
val sc = new SparkContext(conf) 
val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets") 
val ssc = new StreamingContext(sc, Seconds(30)) 
val stream = KafkaUtils.createStream[String, String, StringDecoder, 
StringDecoder]( 
ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) 
stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") 
ssc.start() 
ssc.awaitTermination() 
Initialization 
CassandraRDD 
Stream Initialization 
Transformations 
and Action
Weather Station Analysis 
• Weather station collects data 
• Cassandra stores in sequence 
• Spark rolls up data into new 
tables 
Windsor California 
July 1, 2014 
High: 73.4F 
Low : 51.4F
Roll-up table 
CREATE TABLE daily_aggregate_temperature ( 
wsid text, 
year int, 
month int, 
day int, 
high double, 
low double, 
PRIMARY KEY ((wsid), year, month, day) 
); 
• Weather Station Id(wsid) is unique 
• High and low temp for each day
Setup connection 
def main(args: Array[String]): Unit = { 
// the setMaster("local") lets us run & test the job right in our IDE 
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1").setMaster("local") 
// "local" here is the master, meaning we don't explicitly have a spark master set up 
val sc = new SparkContext("local", "weather", conf) 
val connector = CassandraConnector(conf) 
val cc = new CassandraSQLContext(sc) 
cc.setKeyspace("isd_weather_data")
Get data and aggregate 
// Case class to store row data 
case class daily_aggregate_temperature (wsid: String, year: Int, month: Int, day: Int, high:Double, low:Double) 
// Create SparkSQL statement 
val aggregationSql = "SELECT wsid, year, month, day, max(temperature) high, min(temperature) low " + 
"FROM raw_weather_data " + 
"WHERE month = 6 " + 
"GROUP BY wsid, year, month, day;" 
val srdd: SchemaRDD = cc.sql(aggregationSql); 
val resultSet = srdd.map(row => ( 
new daily_aggregate_temperature( 
row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5)))) 
.collect()
Store back into Cassandra 
connector.withSessionDo(session => { 
// Create a single prepared statement 
val prepared = session.prepare(insertStatement) 
val bound = prepared.bind 
// Iterate over result set and bind variables 
for (row <- resultSet) { 
bound.setString("wsid", row.wsid) 
bound.setInt("year", row.year) 
bound.setInt("month", row.month) 
bound.setInt("day", row.day) 
bound.setDouble("high", row.high) 
bound.setDouble("low", row.low) 
// Insert new row in database 
session.execute(bound) 
} 
})
Result 
wsid | year | month | day | high | low 
--------------+------+-------+-----+------+------ 
725300:94846 | 2012 | 9 | 30 | 18.9 | 10.6 
725300:94846 | 2012 | 9 | 29 | 25.6 | 9.4 
725300:94846 | 2012 | 9 | 28 | 19.4 | 11.7 
725300:94846 | 2012 | 9 | 27 | 17.8 | 7.8 
725300:94846 | 2012 | 9 | 26 | 22.2 | 13.3 
725300:94846 | 2012 | 9 | 25 | 25 | 11.1 
725300:94846 | 2012 | 9 | 24 | 21.1 | 4.4 
725300:94846 | 2012 | 9 | 23 | 15.6 | 5 
725300:94846 | 2012 | 9 | 22 | 15 | 7.2 
725300:94846 | 2012 | 9 | 21 | 18.3 | 9.4 
725300:94846 | 2012 | 9 | 20 | 21.7 | 11.7 
725300:94846 | 2012 | 9 | 19 | 22.8 | 5.6 
725300:94846 | 2012 | 9 | 18 | 17.2 | 9.4 
725300:94846 | 2012 | 9 | 17 | 25 | 12.8 
725300:94846 | 2012 | 9 | 16 | 25 | 10.6 
725300:94846 | 2012 | 9 | 15 | 26.1 | 11.1 
725300:94846 | 2012 | 9 | 14 | 23.9 | 11.1 
725300:94846 | 2012 | 9 | 13 | 26.7 | 13.3 
725300:94846 | 2012 | 9 | 12 | 29.4 | 17.2 
725300:94846 | 2012 | 9 | 11 | 28.3 | 11.7 
725300:94846 | 2012 | 9 | 10 | 23.9 | 12.2 
725300:94846 | 2012 | 9 | 9 | 21.7 | 12.8 
725300:94846 | 2012 | 9 | 8 | 22.2 | 12.8 
725300:94846 | 2012 | 9 | 7 | 25.6 | 18.9 
725300:94846 | 2012 | 9 | 6 | 30 | 20.6 
725300:94846 | 2012 | 9 | 5 | 30 | 17.8 
725300:94846 | 2012 | 9 | 4 | 32.2 | 21.7 
725300:94846 | 2012 | 9 | 3 | 30.6 | 21.7 
725300:94846 | 2012 | 9 | 2 | 27.2 | 21.7 
725300:94846 | 2012 | 9 | 1 | 27.2 | 21.7 
SELECT wsid, year, month, day, high, low 
FROM daily_aggregate_temperature 
WHERE wsid = '725300:94846' 
AND year=2012 AND month=9 ;
What just happened? 
• Data is read from raw_weather_data table 
• Transformed 
• Inserted into the daily_aggregate_temperature table 
Table: 
raw_weather_data 
Table: 
daily_aggregate_tem 
perature 
Read data 
from table Transform Insert data 
into table
Weather Station Stream Analysis 
• Weather station collects data 
• Data processed in stream 
• Data stored in Cassandra 
Windsor California 
Today 
Rainfall total: 1.2cm 
High: 73.4F 
Low : 51.4F
Spark Versus Spark Streaming 
zillions of bytes gigabytes per second
Analytic 
Analytic 
Search 
Spark Streaming 
Kinesis,'S3'
DStream - Micro Batches 
• Continuous sequence of micro batches 
• More complex processing models are possible with less effort 
• Streaming computations as a series of deterministic batch 
computations on small time intervals 
DStream 
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) 
Processing of DStream = Processing of μBatches, RDDs
Spark Streaming Reduce Example 
val sc = new SparkContext(..) 
val ssc = new StreamingContext(sc, Seconds(5)) 
val stream = TwitterUtils.createStream(ssc, auth, filters, StorageLevel.MEMORY_ONLY_SER_2) 
val transform = (cruft: String) => 
Pattern.findAllIn(cruft).flatMap(_.stripPrefix("#")) 
/** Note that Cassandra is doing the sorting for you here. */ 
stream.flatMap(_.getText.toLowerCase.split("""s+""")) 
.map(transform) 
.countByValueAndWindow(Seconds(5), Seconds(5)) 
.transform((rdd, time) => rdd.map { case (term, count) => (term, count, now(time))}) 
.saveToCassandra(keyspace, suspicious, SomeColumns(“suspicious", "count", “timestamp")) 
Even Machine Learning!
Temperature High/Low Stream 
Weather 
Stations 
Receive API 
Apache Kafka 
Producer 
TemperatureActor 
TemperatureActor 
TemperatureActor 
Consumer
You can do this at home! 
https://github.com/killrweather/killrweather
Databricks & Datastax 
Apache Spark is packaged as part of Datastax 
Enterprise Analytics 4.5 
Databricks & Datastax Have Partnered for 
Apache Spark Engineering and Support 
http://www.datastax.com/
Resources 
•Spark Cassandra Connector 
https://github.com/datastax/spark-cassandra-connector 
•Apache Cassandra http://cassandra.apache.org 
•Apache Spark http://spark.apache.org 
•Apache Kafka http://kafka.apache.org 
•Akka http://akka.io 
Analytic 
Analytic
FREE tickets to our Annual Cassandra Summit Europe taking place in London in early December (3rd 
and 4th). The 4th is a full conference day with free admission to all attendees and will feature 
presentations by companies like ING, Credit Suisse, Target, UBS, The Noble Group as well as other top 
Cassandra experts in the world. 
There will be content for those entirely new to Cassandra all the way to the most seasoned Cassandra 
veteran, spanning development, architecture, and operations as well as how to integrate Cassandra with 
analytics and search technologies like Apache Spark and Apache Solr. 
December 3rd is a paid training day. If you are interested in getting a discount on paid training, please 
speak with Diego - dferreira@datastax.com
Munich Cassandra Users 
Join your local Cassandra meetup group: 
http://www.meetup.com/Munchen-Cassandra- 
Users/ 
© 2014 DataStax, All Rights Reserved. 
56
Thank you 
Follow me on twitter for more updates 
@PatrickMcFadin

Contenu connexe

Tendances

Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
Ryan King
 

Tendances (20)

Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 

Similaire à Apache cassandra & apache spark for time series data

Similaire à Apache cassandra & apache spark for time series data (20)

Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series Analysis
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Presentation
PresentationPresentation
Presentation
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Manchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra IntegrationManchester Hadoop Meetup: Spark Cassandra Integration
Manchester Hadoop Meetup: Spark Cassandra Integration
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
1 Dundee - Cassandra 101
1 Dundee - Cassandra 1011 Dundee - Cassandra 101
1 Dundee - Cassandra 101
 
Data Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and SparkData Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and Spark
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japan
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 

Plus de Patrick McFadin

The world's next top data model
The world's next top data modelThe world's next top data model
The world's next top data model
Patrick McFadin
 

Plus de Patrick McFadin (20)

Successful Architectures for Fast Data
Successful Architectures for Fast DataSuccessful Architectures for Fast Data
Successful Architectures for Fast Data
 
Open source or proprietary, choose wisely!
Open source or proprietary,  choose wisely!Open source or proprietary,  choose wisely!
Open source or proprietary, choose wisely!
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Help! I want to contribute to an Open Source project but my boss says no.
Help! I want to contribute to an Open Source project but my boss says no.Help! I want to contribute to an Open Source project but my boss says no.
Help! I want to contribute to an Open Source project but my boss says no.
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Cassandra 3.0 advanced preview
Cassandra 3.0 advanced previewCassandra 3.0 advanced preview
Cassandra 3.0 advanced preview
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandra
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014
 
Making money with open source and not losing your soul: A practical guide
Making money with open source and not losing your soul: A practical guideMaking money with open source and not losing your soul: A practical guide
Making money with open source and not losing your soul: A practical guide
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Cassandra EU - Data model on fire
Cassandra EU - Data model on fireCassandra EU - Data model on fire
Cassandra EU - Data model on fire
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
Cassandra 2.0 better, faster, stronger
Cassandra 2.0   better, faster, strongerCassandra 2.0   better, faster, stronger
Cassandra 2.0 better, faster, stronger
 
Building Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache CassandraBuilding Antifragile Applications with Apache Cassandra
Building Antifragile Applications with Apache Cassandra
 
Cassandra at scale
Cassandra at scaleCassandra at scale
Cassandra at scale
 
The world's next top data model
The world's next top data modelThe world's next top data model
The world's next top data model
 
Become a super modeler
Become a super modelerBecome a super modeler
Become a super modeler
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Apache cassandra & apache spark for time series data

  • 1. Apache Cassandra & Apache Spark for Time Series Data Patrick McFadin Chief Evangelist for Apache Cassandra, DataStax @PatrickMcFadin ©2013 DataStax Confidential. Do not distribute without consent. 1
  • 2. Cassandra for Applications APACHE CASSANDRA
  • 3. Cassandra is… • Shared nothing • Masterless peer-to-peer • Based on Dynamo
  • 4. Scaling • Add nodes to scale • Millions Ops/s Cassandra HBase Redis MySQL THROUGHPUT OPS/SEC)
  • 5. Uptime • Built to replicate • Resilient to failure • Always on NONE
  • 6. Replication DC1 10.0.0.1 00-25 10.0.0.4 76-100 10.0.0.2 26-50 10.0.0.3 51-75 DC1: RF=3 DC2 10.10.0.1 00-25 Asynchronous WAN Replication 10.10.0.4 76-100 10.10.0.2 26-50 10.10.0.3 51-75 DC2: RF=3 Client Insert Data Asynchronous Local Replication
  • 7. Data Model • Familiar syntax • Collections • PRIMARY KEY for uniqueness CREATE TABLE videos ( videoid uuid, userid uuid, name varchar, description varchar, location text, location_type int, preview_thumbnails map<text,text>, tags set<varchar>, added_date timestamp, PRIMARY KEY (videoid) );
  • 8. Data Model - User Defined Types • Complex data in one place • No multi-gets (multi-partitions) • Nesting! CREATE TYPE address ( street text, city text, zip_code int, country text, cross_streets set<text> );
  • 9. Data Model - Updated • Now video_metadata is embedded in videos CREATE TYPE video_metadata ( height int, width int, video_bit_rate set<text>, encoding text ); CREATE TABLE videos ( videoid uuid, userid uuid, name varchar, description varchar, location text, location_type int, preview_thumbnails map<text,text>, tags set<varchar>, metadata set <frozen<video_metadata>>, added_date timestamp, PRIMARY KEY (videoid) );
  • 10. Data Model - Storing JSON { "productId": 2, "name": "Kitchen Table", "price": 249.99, "description" : "Rectangular table with oak finish", "dimensions": { "units": "inches", "length": 50.0, "width": 66.0, "height": 32 }, "categories": { { "category" : "Home Furnishings" { "catalogPage": 45, "url": "/home/furnishings" }, { "category" : "Kitchen Furnishings" { "catalogPage": 108, "url": "/kitchen/furnishings" } } } CREATE TYPE dimensions ( units text, length float, width float, height float ); CREATE TYPE category ( catalogPage int, url text ); CREATE TABLE product ( productId int, name text, price float, description text, dimensions frozen <dimensions>, categories map <text, frozen <category>>, PRIMARY KEY (productId) );
  • 11. Why… Cassandra for Time Series? Spark as a great addition to Cassandra?
  • 12. Example 1: Weather Station • Weather station collects data • Cassandra stores in sequence • Application reads in sequence
  • 13. Use case • Get all data for one weather station • Get data for a single date and time • Get data for a range of dates and times • Store data per weather station • Store time series in order: first to last Needed Queries Data Model to support queries
  • 14. Data Model • Weather Station Id and Time are unique • Store as many as needed CREATE TABLE temperature ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY ((weather_station),year,month,day,hour) ); INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6); INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,8,-5.1); INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,9,-4.9); INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,10,-5.3);
  • 15. Storage Model - Logical View weather_station hour temperature 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 SELECT weather_station,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1; 10010:99999 10010:99999 10010:99999 2005:12:1:10 -5.3 10010:99999
  • 16. 2005:12:1:12 -5.4 2005:12:1:11 Storage Model - Disk Layout SELECT weather_station,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1; -5.1 -4.9 -5.3 -4.9 2005:12:1:7 -5.6 2005:12:1:8 2005:12:1:9 10010:99999 2005:12:1:10 Merged, Sorted and Stored Sequentially
  • 17. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour)
  • 18. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour) Partition Key
  • 19. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour) Partition Key Clustering Columns
  • 20. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour) Partition Key Clustering Columns 10010:99999
  • 21. Primary key relationship PRIMARY KEY (weatherstation_id,year,month,day,hour) Partition Key Clustering Columns 2005:12:1:7 -5.6 10010:99999 2005:12:1:8 2005:12:1:9 2005:12:1:10 -5.1 -4.9 -5.3
  • 22. Data Locality weatherstation_id=‘10010:99999’ ? 1000 Node Cluster You are here!
  • 23. Query patterns • Range queries • “Slice” operation on disk SELECT weatherstation,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10; Single seek on disk 2005:12:1:12 -5.4 2005:12:1:11 -5.1 -4.9 -5.3 -4.9 2005:12:1:7 -5.6 2005:12:1:8 2005:12:1:9 10010:99999 2005:12:1:10 Partition key for locality
  • 24. Query patterns • Range queries • “Slice” operation on disk Sorted by event_time Programmers like this SELECT weatherstation,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10; weather_station hour temperature 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 10010:99999 10010:99999 10010:99999 2005:12:1:10 -5.3 10010:99999
  • 26. Apache Spark • 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets • Batch, iterative and streaming analysis • In Memory Storage and Disk • Integrates with Most File and Storage Options Up to 100× faster (2-10× on disk) 2-5× less code
  • 27. Spark Components Spark SQL structured Spark Core Spark Streaming real-time MLlib machine learning GraphX graph
  • 28.
  • 29. org.apache.spark.rdd.RDD Resilient Distributed Dataset (RDD) •Created through transformations on data (map,filter..) or other RDDs •Immutable •Partitioned •Reusable
  • 30. RDD Operations •Transformations - Similar to scala collections API •Produce new RDDs •filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract •Actions •Require materialization of the records to generate a value •collect: Array[T], count, fold, reduce..
  • 31. Analytic Analytic Search RDD Operations Transformation Action
  • 32. Collections and Files To RDD scala> val distData = sc.parallelize(Seq(1,2,3,4,5) distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e val distFile: RDD[String] = sc.textFile(“directory/*.txt”) val distFile = sc.textFile(“hdfs://namenode:9000/path/file”) val distFile = sc.sequenceFile(“hdfs://namenode:9000/path/file”)
  • 34. Spark on Cassandra • Server-Side filters (where clauses) • Cross-table operations (JOIN, UNION, etc.) • Data locality-aware (speed) • Data transformation, aggregation, etc. • Natural Time Series Integration
  • 35. Spark Cassandra Connector • Loads data from Cassandra to Spark • Writes data from Spark to Cassandra • Implicit Type Conversions and Object Mapping • Implemented in Scala (offers a Java API) • Open Source • Exposes Cassandra Tables as Spark RDDs + Spark DStreams
  • 36. Spark Cassandra Connector C* User Application Spark-Cassandra Connector Cassandra C* C* C* Spark Executor C* Java (Soon Scala) Driver https://github.com/datastax/spark-cassandra-connector
  • 37. Spark Cassandra Example val conf = new SparkConf(loadDefaults = true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster("spark://127.0.0.1:7077") val sc = new SparkContext(conf) val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets") val ssc = new StreamingContext(sc, Seconds(30)) val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") ssc.start() ssc.awaitTermination() Initialization CassandraRDD Stream Initialization Transformations and Action
  • 38. Weather Station Analysis • Weather station collects data • Cassandra stores in sequence • Spark rolls up data into new tables Windsor California July 1, 2014 High: 73.4F Low : 51.4F
  • 39. Roll-up table CREATE TABLE daily_aggregate_temperature ( wsid text, year int, month int, day int, high double, low double, PRIMARY KEY ((wsid), year, month, day) ); • Weather Station Id(wsid) is unique • High and low temp for each day
  • 40. Setup connection def main(args: Array[String]): Unit = { // the setMaster("local") lets us run & test the job right in our IDE val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1").setMaster("local") // "local" here is the master, meaning we don't explicitly have a spark master set up val sc = new SparkContext("local", "weather", conf) val connector = CassandraConnector(conf) val cc = new CassandraSQLContext(sc) cc.setKeyspace("isd_weather_data")
  • 41. Get data and aggregate // Case class to store row data case class daily_aggregate_temperature (wsid: String, year: Int, month: Int, day: Int, high:Double, low:Double) // Create SparkSQL statement val aggregationSql = "SELECT wsid, year, month, day, max(temperature) high, min(temperature) low " + "FROM raw_weather_data " + "WHERE month = 6 " + "GROUP BY wsid, year, month, day;" val srdd: SchemaRDD = cc.sql(aggregationSql); val resultSet = srdd.map(row => ( new daily_aggregate_temperature( row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5)))) .collect()
  • 42. Store back into Cassandra connector.withSessionDo(session => { // Create a single prepared statement val prepared = session.prepare(insertStatement) val bound = prepared.bind // Iterate over result set and bind variables for (row <- resultSet) { bound.setString("wsid", row.wsid) bound.setInt("year", row.year) bound.setInt("month", row.month) bound.setInt("day", row.day) bound.setDouble("high", row.high) bound.setDouble("low", row.low) // Insert new row in database session.execute(bound) } })
  • 43. Result wsid | year | month | day | high | low --------------+------+-------+-----+------+------ 725300:94846 | 2012 | 9 | 30 | 18.9 | 10.6 725300:94846 | 2012 | 9 | 29 | 25.6 | 9.4 725300:94846 | 2012 | 9 | 28 | 19.4 | 11.7 725300:94846 | 2012 | 9 | 27 | 17.8 | 7.8 725300:94846 | 2012 | 9 | 26 | 22.2 | 13.3 725300:94846 | 2012 | 9 | 25 | 25 | 11.1 725300:94846 | 2012 | 9 | 24 | 21.1 | 4.4 725300:94846 | 2012 | 9 | 23 | 15.6 | 5 725300:94846 | 2012 | 9 | 22 | 15 | 7.2 725300:94846 | 2012 | 9 | 21 | 18.3 | 9.4 725300:94846 | 2012 | 9 | 20 | 21.7 | 11.7 725300:94846 | 2012 | 9 | 19 | 22.8 | 5.6 725300:94846 | 2012 | 9 | 18 | 17.2 | 9.4 725300:94846 | 2012 | 9 | 17 | 25 | 12.8 725300:94846 | 2012 | 9 | 16 | 25 | 10.6 725300:94846 | 2012 | 9 | 15 | 26.1 | 11.1 725300:94846 | 2012 | 9 | 14 | 23.9 | 11.1 725300:94846 | 2012 | 9 | 13 | 26.7 | 13.3 725300:94846 | 2012 | 9 | 12 | 29.4 | 17.2 725300:94846 | 2012 | 9 | 11 | 28.3 | 11.7 725300:94846 | 2012 | 9 | 10 | 23.9 | 12.2 725300:94846 | 2012 | 9 | 9 | 21.7 | 12.8 725300:94846 | 2012 | 9 | 8 | 22.2 | 12.8 725300:94846 | 2012 | 9 | 7 | 25.6 | 18.9 725300:94846 | 2012 | 9 | 6 | 30 | 20.6 725300:94846 | 2012 | 9 | 5 | 30 | 17.8 725300:94846 | 2012 | 9 | 4 | 32.2 | 21.7 725300:94846 | 2012 | 9 | 3 | 30.6 | 21.7 725300:94846 | 2012 | 9 | 2 | 27.2 | 21.7 725300:94846 | 2012 | 9 | 1 | 27.2 | 21.7 SELECT wsid, year, month, day, high, low FROM daily_aggregate_temperature WHERE wsid = '725300:94846' AND year=2012 AND month=9 ;
  • 44. What just happened? • Data is read from raw_weather_data table • Transformed • Inserted into the daily_aggregate_temperature table Table: raw_weather_data Table: daily_aggregate_tem perature Read data from table Transform Insert data into table
  • 45. Weather Station Stream Analysis • Weather station collects data • Data processed in stream • Data stored in Cassandra Windsor California Today Rainfall total: 1.2cm High: 73.4F Low : 51.4F
  • 46. Spark Versus Spark Streaming zillions of bytes gigabytes per second
  • 47. Analytic Analytic Search Spark Streaming Kinesis,'S3'
  • 48. DStream - Micro Batches • Continuous sequence of micro batches • More complex processing models are possible with less effort • Streaming computations as a series of deterministic batch computations on small time intervals DStream μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) Processing of DStream = Processing of μBatches, RDDs
  • 49. Spark Streaming Reduce Example val sc = new SparkContext(..) val ssc = new StreamingContext(sc, Seconds(5)) val stream = TwitterUtils.createStream(ssc, auth, filters, StorageLevel.MEMORY_ONLY_SER_2) val transform = (cruft: String) => Pattern.findAllIn(cruft).flatMap(_.stripPrefix("#")) /** Note that Cassandra is doing the sorting for you here. */ stream.flatMap(_.getText.toLowerCase.split("""s+""")) .map(transform) .countByValueAndWindow(Seconds(5), Seconds(5)) .transform((rdd, time) => rdd.map { case (term, count) => (term, count, now(time))}) .saveToCassandra(keyspace, suspicious, SomeColumns(“suspicious", "count", “timestamp")) Even Machine Learning!
  • 50. Temperature High/Low Stream Weather Stations Receive API Apache Kafka Producer TemperatureActor TemperatureActor TemperatureActor Consumer
  • 51. You can do this at home! https://github.com/killrweather/killrweather
  • 52. Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datastax Have Partnered for Apache Spark Engineering and Support http://www.datastax.com/
  • 53. Resources •Spark Cassandra Connector https://github.com/datastax/spark-cassandra-connector •Apache Cassandra http://cassandra.apache.org •Apache Spark http://spark.apache.org •Apache Kafka http://kafka.apache.org •Akka http://akka.io Analytic Analytic
  • 54.
  • 55. FREE tickets to our Annual Cassandra Summit Europe taking place in London in early December (3rd and 4th). The 4th is a full conference day with free admission to all attendees and will feature presentations by companies like ING, Credit Suisse, Target, UBS, The Noble Group as well as other top Cassandra experts in the world. There will be content for those entirely new to Cassandra all the way to the most seasoned Cassandra veteran, spanning development, architecture, and operations as well as how to integrate Cassandra with analytics and search technologies like Apache Spark and Apache Solr. December 3rd is a paid training day. If you are interested in getting a discount on paid training, please speak with Diego - dferreira@datastax.com
  • 56. Munich Cassandra Users Join your local Cassandra meetup group: http://www.meetup.com/Munchen-Cassandra- Users/ © 2014 DataStax, All Rights Reserved. 56
  • 57. Thank you Follow me on twitter for more updates @PatrickMcFadin