5. Applications by Response Time and Workload
5
Real-time transactions
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
6. Applications by Response Time and Workload
6
Real-time transactions
Real-time analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
7. Applications by Response Time and Workload
7
Real-time transactions
Real-time analytics
Streaming analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
8. Applications by Response Time and Workload
8
Real-time transactions
Real-time analytics
Streaming analytics
Batch analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
• Predictive models
• Fraud detection
9. Roles of Cassandra and Spark
9
Real-time transactions
Real-time analytics
Streaming analytics
Batch analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
• Predictive models
• Fraud detection
Numerous Endpoints
Geographically Distributed
Continuously Available
Instantaneously Responsive
Immediately Decisive
Predictably Scalable
10. 1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
10
11. Cassandra – Operational Database
• Millions of concurrent users
• Millisecond response time
• Linear scalability
• Always on
11
13. Spark – Analytics Platform
• Real-time, streaming and batch analytics
• Up to 100x faster than Hadoop
• Scalability, fault-resilience
• Versatile and rich API
13
25. 1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
25
26. Getting Started with Cassandra and Spark Applications
• Data Model and Cassandra Query Language
• Core Spark and Spark-Cassandra Connector
26
27. Keyspace and Replication
27
CREATE KEYSPACE iot
WITH replication = {'class': 'NetworkTopologyStrategy',
'DC-Kyiv-Operations' : 3,
'DC-Houston-Analytics': 2};
USE iot;
28. Table with Single-Row Partitions
28
username age address
Alice 28 Santa Clara, CA
Alex 37 Austin, TX
users CREATE TABLE users (
username TEXT,
age INT,
address TEXT,
PRIMARY KEY(username)
);
SELECT * FROM users
WHERE username = ?;
29. Table with Single-Row Partitions
29
id type settings owner
1 phone {gps ⇒ on,
pedometer ⇒ on}
Alice
2 wristband {heart rate ⇒ on, …} Alice
3 thermostat {temp ⇒ 75, …} Alice
4 security {…} Alex
5 phone {…} Alex
sensors CREATE TABLE sensors (
id INT,
type TEXT,
settings MAP<TEXT,TEXT>,
owner TEXT,
PRIMARY KEY(id)
);
SELECT * FROM sensors
WHERE id = ?;
30. Table with Multi-Row Partitions
30
username id type settings age address
Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA
Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA
Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA
Alex 4 security … 37 Austin, TX
Alex 5 phone … 37 Austin, TX
sensors_by_user
ASCASC
31. Table with Multi-Row Partitions
CREATE TABLE sensors_by_user (
username TEXT, age INT STATIC, address TEXT STATIC,
id INT, type TEXT, settings MAP<TEXT,TEXT>,
PRIMARY KEY(username, id)
) WITH CLUSTERING ORDER BY (id ASC);
SELECT * FROM sensors_by_user WHERE username = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id > ?
ORDER BY id DESC;
31
32. Retrieving Data from C*
• SparkContext, RDD, Connector
32
val rdd = sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
36. • Standard Spark join = shuffle + shuffle
Cassandra-Optimized Joins
36
val s = sc.cassandraTable("iot","sensors")
.keyBy(row => row.getString("owner"))
val u = sc.cassandraTable("iot","users")
.keyBy(row => row.getString("username"))
s.join(u)
37. Cassandra-Optimized Joins
37
• Shuffle
B
Partition 1
A C D
Map Task
Partition A
Reduce Task
B
Partition 2
A C D
Map Task
Partition B
Reduce Task
B
Partition 3
A C D
Map Task
Partition D
Reduce Task
Partition C
Reduce Task
Buckets:
memory
memory
Shuffle write
Shuffle read
disk
disk
Aggregation Aggregation Aggregation
Aggregation AggregationAggregationAggregation
38. • Connector join = no shuffle + no data locality
Cassandra-Optimized Joins
38
sc.cassandraTable("iot","sensors")
.select("id","type","owner".as("username"))
.joinWithCassandraTable("iot","users")
.on(SomeColumns("username"))
39. Cassandra-Optimized Joins
39
id type owner
username
1 … Alice
4 … Alex
3 ... Alice
2 … Alice
5 … Alex
username age address
Alex 37 …
username age address
Alice 28 …
40. • Connector join + CAP = shuffle + data locality
Cassandra-Aware Partitioning
40
sc.cassandraTable("iot","sensors")
.select("id","type","owner".as("username"))
.repartitionByCassandraReplica("iot","users")
.joinWithCassandraTable("iot","users")
.on(SomeColumns("username"))
41. Cassandra-Aware Partitioning
41
id type owner
username
1 … Alice
2 … Alice
3 ... Alice
username age address
Alex 37 …
username age address
Alice 28 …
id type owner
username
4 … Alex
5 … Alex
43. • Shuffling eliminated at no extra cost
Shuffle-Free Grouping
43
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.as((u:String,i:Int,t:String)=>(u,(i,t)))
.groupByKeyspanByKey
44. Saving Data to C*
44
rdd.saveToCassandra("iot","users",
SomeColumns("username", "age"))
output.consistency.level (LOCAL_QUORUM)
output.batch.grouping.key (Partition)
output.batch.size.bytes (1024)
output.batch.grouping.buffer.size (1000)
output.concurrent.writes (5)
45. 1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
45
Apple-scale
Transactions per second
1000-node clusters
Multiple data centers
Personalization
Product catalogs
Fraud detection
Internet of Things
Messaging
In-memory
This is a quick review of what you already know about Cassandra:
Peer-to-peer architecture
Failure tolerance/availability
Cassandra token ring
Data structures (table)
Data distribution (partition key)
Data replication (replication factor)
Data consistency (consistency level)
Master-worker architecture Master (aka cluster manager) manages Workers and their resources Workers instantiate Executors and give them resources (cores and memory) Driver schedules computation directly with Executors Failure tolerance
Worker/Executor - no problem; computation is picked up by remaining Workers/Executors
Master - new Master is elected (DSE feature); running applications are not affected; new applications are only affected temporarily until new Master is started