Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
2. @doanduyhai
Who Am I ?!
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
3. @doanduyhai
Datastax!
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
5. @doanduyhai
What is Apache Spark ?!
Created at
Apache Project since 2010
General data processing framework
MapReduce is not the A & ΩΩ
One-framework-many-components approach
5
6. @doanduyhai
Spark characteristics!
Fast
• 10x-100x faster than Hadoop MapReduce
• In-memory storage
• Single JVM process per node, multi-threaded
Easy
• Rich Scala, Java and Python APIs (R is coming …)
• 2x-5x less code
• Interactive shell
6
12. @doanduyhai
What is Apache Cassandra?!
Created at
Apache Project since 2009
Distributed NoSQL database
Eventual consistency (A & P of the CAP theorem)
Distributed table abstraction
12
13. @doanduyhai
Cassandra data distribution reminder!
Random: hash of #partition → token = hash(#p)
Hash: ]-X, X]
X = huge number (264/2)
n1
n2
n3
n4
n5
n6
n7
n8
13
14. @doanduyhai
Cassandra token ranges!
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]
Murmur3 hash function
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
14
17. @doanduyhai
Cassandra Query Language (CQL)!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE users SET age = 34 WHERE login = ‘jdoe’;
DELETE age FROM users WHERE login = ‘jdoe’;
SELECT age FROM users WHERE login = ‘jdoe’;
17
18. @doanduyhai
Why Spark on Cassandra ?!
Reliable persistent store (HA)
Structured data (Cassandra CQL à Dataframe API)
Multi data-center !!!
For Spark
18
19. @doanduyhai
Why Spark on Cassandra ?!
Reliable persistent store (HA)
Structured data (Cassandra CQL à Dataframe API)
Multi data-center !!!
Cross-table operations (JOIN, UNION, etc.)
Real-time/batch processing
Complex analytics (e.g. machine learning)
For Spark
For Cassandra
19
20. @doanduyhai
Use Cases!
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
20
25. @doanduyhai
Connector architecture – Core API!
Cassandra tables exposed as Spark RDDs
Read from and write to Cassandra
Mapping of C* tables and rows to Scala objects
• CassandraRow
• Scala case class (object mapper)
• Scala tuples
25
26. @doanduyhai
Connector architecture – Spark SQL !
Mapping of Cassandra table to SchemaRDD
• CassandraSQLRow à SparkRow
• custom query plan
• push predicates to CQL for early filtering
SELECT * FROM user_emails WHERE login = ‘jdoe’;
26
27. @doanduyhai
Connector architecture – Spark Streaming !
Streaming data INTO Cassandra table
• trivial setup
• be careful about your Cassandra data model !!!
Streaming data OUT of Cassandra tables ?
• work in progress …
27
38. @doanduyhai
Or async batch writes!
Async batches fan-out writes to Cassandra
Spark shuffle operations
38
39. @doanduyhai
Write data locality!
39
• either stream data with Spark using repartitionByCassandraReplica()
• or flush data to Cassandra by async batches
• in any case, there will be data movement on network (sorry no magic)
40. @doanduyhai
Joins with data locality!
40
CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));
CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
// Repartition RDDs by "artists" PK, which is "name"
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
// Join with "artists" table, selecting only "name" and "country" columns
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
41. @doanduyhai
Joins pipeline with data locality!
41
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
.map(…)
.filter(…)
.groupByKey()
.mapValues(…)
.repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS)
.joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS)
…
!
!
42. @doanduyhai
Perfect data locality scenario!
42
• read localy from Cassandra
• use operations that do not require shuffle in Spark (map, filter, …)
• repartitionbyCassandraReplica()
• à to a table having same partition key as original table
• save back into this Cassandra table
44. @doanduyhai
What’s for future ?!
Datastax Enterprise 4.7
• Cassandra + Spark + Solr as your analytics platform
Filter out most data possible with Solr from Cassandra
Fetch the filtered data in Spark and perform aggregations
Save back final data into Cassandra
44
46. @doanduyhai
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")
What’s for future ?!
1. compute Spark partitions using Cassandra token ranges
2. on each partition, use Solr for local data filtering (no fan out !)
3. fetch data back into Spark for aggregations
4. repeat 1 – 3 as many times as necessary
46
47. @doanduyhai
What’s for future ?!
47
SELECT … FROM …
WHERE token(#partition)> 3X/8
AND token(#partition)<= 4X/8
AND solr_query='full text search expression';
1
2
3
Advantages of same JVM Cassandra + Solr integration
1
Single-pass local full text search (no fan out) 2
Data retrieval
D: ] 3X/8, 4X/8]