- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
2. @doanduyhai
Who Am I ?!
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
2
3. @doanduyhai
Datastax!
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
7. @doanduyhai
Connector architecture – Core API!
Cassandra tables exposed as Spark RDDs
Read from and write to Cassandra
Mapping of Cassandra tables and rows to Scala objects
• CassandraRow
• Scala case class (object mapper)
• Scala tuples
7
9. @doanduyhai
Connector architecture – Spark SQL !
Mapping of Cassandra table to SchemaRDD
• CassandraSQLRow à SparkRow
• custom query plan
• push predicates to CQL for early filtering
SELECT * FROM user_emails WHERE login = ‘jdoe’;
9
11. @doanduyhai
Connector architecture – Spark Streaming !
Streaming data INTO Cassandra table
• trivial setup
• be careful about your Cassandra data model when having an infinite
stream !!!
Streaming data OUT of Cassandra tables (CDC) ?
• work in progress …
11
21. @doanduyhai
Write to Cassandra without data locality!
Async batches fan-out writes to Cassandra
21
Because of shuffle, original data locality is lost
23. @doanduyhai
Write data locality!
23
• either stream data in Spark layer using repartitionByCassandraReplica()
• or flush data to Cassandra by async batches
• in any case, there will be data movement on network (sorry no magic)
24. @doanduyhai
Joins with data locality!
24
CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));
CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year")
.as((_:String, _:Int))
// Repartition RDDs by "artists" PK, which is "name"
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
// Join with "artists" table, selecting only "name" and "country" columns
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name"))
32. @doanduyhai
Perfect data locality scenario!
32
• read localy from Cassandra
• use operations that do not require shuffle in Spark (map, filter, …)
• repartitionbyCassandraReplica()
• à to a table having same partition key as original table
• save back into this Cassandra table
Sanitize, validate, normalize, transform data
USE CASE
35. @doanduyhai
Failure handling!
What if 1 node down ?
What if 1 node overloaded ?
☞ Spark masterwill re-assign
the job to another worker
35
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
40. @doanduyhai
Failure handling!
If RF > 1 the Spark master chooses
the next preferred location, which
is a replica 😎
Tune parameters:
① spark.locality.wait
② spark.locality.wait.process
③ spark.locality.wait.node
40
C*
SparkM
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
C*
SparkW
41. @doanduyhai
val confDC1 = new SparkConf(true)
.setAppName("data_migration")
.setMaster("master_ip")
.set("spark.cassandra.connection.host", "DC_1_hostnames")
.set("spark.cassandra.connection.local_dc", "DC_1")
val confDC2 = new SparkConf(true)
.setAppName("data_migration")
.setMaster("master_ip")
.set("spark.cassandra.connection.host", "DC_2_hostnames")
.set("spark.cassandra.connection.local_dc", "DC_2 ")
val sc = new SparkContext(confDC1)
sc.cassandraTable[Performer](KEYSPACE,PERFORMERS)
.map[Performer](???)
.saveToCassandra(KEYSPACE,PERFORMERS)
(CassandraConnector(confDC2),implicitly[RowWriterFactory[Performer]])
Cross-DC operations!
41
42. @doanduyhai
val confCluster1 = new SparkConf(true)
.setAppName("data_migration")
.setMaster("master_ip")
.set("spark.cassandra.connection.host", "cluster_1_hostnames")
val confCluster2 = new SparkConf(true)
.setAppName("data_migration")
.setMaster("master_ip")
.set("spark.cassandra.connection.host", "cluster_2_hostnames")
val sc = new SparkContext(confCluster1)
sc.cassandraTable[Performer](KEYSPACE,PERFORMERS)
.map[Performer](???)
.saveToCassandra(KEYSPACE,PERFORMERS)
(CassandraConnector(confCluster2),implicitly[RowWriterFactory[Performer]])
Cross-cluster operations!
42
45. @doanduyhai
Use Cases!
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize, transform data
Schema migration,
Data conversion
45
46. @doanduyhai
Data cleaning use-cases!
46
Bug in your application ?
Dirty input data ?
☞ Spark job to clean it up! (perfect data locality)
Sanitize, validate, normalize, transform data
50. @doanduyhai
Analytics use-cases!
50
Given existing tables of performers and albums, I want:
① top 10 most common music styles (pop,rock, RnB, …) ?
② performer productivity(albums count) by origin country and by
decade ?
☞ Spark job to compute analytics !
Analytics (join, aggregate, transform, …)
51. @doanduyhai
Analytics pipeline!
51
① Read from production transactional tables
② Perform aggregation with Spark
③ Save back data into dedicated tables for fast visualization
④ Repeat step ①
55. @doanduyhai
Our vision!
55
We had a dream …
to provide a Big Data Platform …
built for the performance & high availabitly demands
of IoT, web & mobile applications
56. @doanduyhai
Until now in Datastax Enterprise!
Cassandra + Solr in same JVM
Unlock full text search power for Cassandra
CQL syntax extension
56
SELECT * FROM users WHERE solr_query = ‘age:[33 TO *] AND gender:male’;
SELECT * FROM users WHERE solr_query = ‘lastname:*schwei?er’;
61. @doanduyhai
The idea!
① Filter the maximum with Cassandra and Solr
② Fetch only small data set in memory
③ Aggregate with Spark
☞ near real time interactive analytics query possible if restrictive criteria
61
66. @doanduyhai
Stand-alone search cluster caveat!
The ops won’t be your friend
• 3 clusters to manage: Spark, Cassandra & Solr/whatever
• lots of moving parts
Impacts on Spark jobs
• increased response time due to latency
• the 99.9 percentile can be very slow
66
68. @doanduyhai
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")
Datastax Entreprise 4.7!
① compute Spark partitions using Cassandra token ranges
② on each partition, use Solr for local data filtering (no distributed query!)
③ fetch data back into Spark for aggregations
68
69. @doanduyhai
Datastax Entreprise 4.7!
69
SELECT … FROM …
WHERE token(#partition)> 3X/8
AND token(#partition)<= 4X/8
AND solr_query='full text search expression';
1
2
3
Advantages of same JVM Cassandra + Solr integration
1
Single-pass local full text search (no fan out) 2
Data retrieval
Token Range : ] 3X/8, 4X/8]