This document summarizes a presentation about integrating Apache Cassandra with Apache Spark. It introduces Christopher Batey as a technical evangelist for Cassandra and discusses DataStax as an enterprise distribution of Cassandra. It then provides overviews of Cassandra and Spark, describing their architectures and common use cases. The bulk of the document focuses on the Spark Cassandra Connector and examples of using it to load Cassandra data into Spark, perform analytics and aggregations, and write results back to Cassandra. It positions Spark as enabling slower, more flexible queries and analytics on Cassandra data.
2. @chbatey
Who am I? What is DataStax?
• Technical Evangelist for Apache Cassandra
• Cassandra related questions: @chbatey
• DataStax
- Enterprise ready version of Cassandra
- Spark integration
- Solr integration
- OpsCenter
- Majority of Cassandra drivers
7. @chbatey
Common use cases
•Ordered data such as time series
-Event stores
-Financial transactions
-Sensor data e.g IoT
•Non functional requirements:
-Linear scalability
-High throughout durable writes
-Multi datacenter including active-active
-Analytics without ETL
9. @chbatey
Datacenter and rack aware
Europe
• Distributed master less
database (Dynamo)
• Column family data model
(Google BigTable)
• Multi data centre replication
built in from the start
USA
10. @chbatey
Cassandra
Online
• Distributed master less
database (Dynamo)
• Column family data model
(Google BigTable)
• Multi data centre replication
built in from the start
• Analytics with Apache SparkAnalytics
11. @chbatey
Scalability & Performance
• Scalability
- No single point of failure
- No special nodes that become the bottle neck
- Work/data can be re-distributed
• Operational Performance i.e single digit ms
- Single node for query
- Single disk seek per query
13. @chbatey
But but…
• Sometimes you don’t need a answers in milliseconds
• Data models done wrong - how do I fix it?
• New requirements for old data?
• Ad-hoc operational queries
• Reports and Analytics
- Managers always want counts / maxs
14. @chbatey
Apache Spark
• 10x faster on disk,100x faster in memory than Hadoop
MR
• Works out of the box on EMR
• Fault Tolerant Distributed Datasets
• Batch, iterative and streaming analysis
• In Memory Storage and Disk
• Integrates with Most File and Storage Options
18. @chbatey
RDD Operations
• Transformations - Similar to Scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract
• Actions
• Require materialization of the records to generate a value
• collect: Array[T], count, fold, reduce..
19. @chbatey
Word count
val file: RDD[String] = sc.textFile("hdfs://...")
val counts: RDD[(String, Int)] = file.flatMap(line =>
line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
23. @chbatey
Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs + Spark
DStreams
35. @chbatey
Now now…
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("test")
val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events
GROUP BY store_name, event_type")
rdd.collect().foreach(println)
[SportsApp,WATCH_STREAM,1]
[SportsApp,LOGOUT,1]
[SportsApp,LOGIN,1]
[ChrisBatey.com,WATCH_MOVIE,1]
[ChrisBatey.com,LOGOUT,1]
[ChrisBatey.com,BUY_MOVIE,1]
[SportsApp,WATCH_MOVIE,2]
37. @chbatey
Network word count
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)")
session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)")
}
val ssc = new StreamingContext(conf, Seconds(5))
val lines = ssc.socketTextStream("localhost", 9999)
lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw")
val words = lines.flatMap(_.split("s+"))
val countOfOne = words.map((_, 1))
val reduced = countOfOne.reduceByKey(_ + _)
reduced.saveToCassandra("test", "network_word_count")
38. @chbatey
Summary
• Cassandra is an operational database
• Spark gives us the flexibility to do slower things
- Schema migrations
- Ad-hoc queries
- Report generation
• Spark streaming + Cassandra allow us to build online
analytical platforms
39. @chbatey
Thanks for listening
• Follow me on twitter @chbatey
• Cassandra + Fault tolerance posts a plenty:
• http://christopher-batey.blogspot.co.uk/
• Github for all examples:
• https://github.com/chbatey/spark-sandbox
• Cassandra resources: http://planetcassandra.org/
• In London in April? http://www.eventbrite.com/e/cassandra-
day-london-2015-april-22nd-2015-tickets-15053026006?
aff=CommunityLanding