SlideShare une entreprise Scribd logo
1  sur  46
Big Data-Driven Applications
with Cassandra and Spark
Artem Chebotko, Ph.D.
Solution Architect
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
2
Modern Application Requirements
• Numerous Endpoints
• Geographically Distributed
• Continuously Available
• Instantaneously Responsive
• Immediately Decisive
• Predictably Scalable
3
Applications by Response Time and Workload
4
Analytical (OLAP)Operational (OLTP)
Applications by Response Time and Workload
5
Real-time transactions
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
Applications by Response Time and Workload
6
Real-time transactions
Real-time analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
Applications by Response Time and Workload
7
Real-time transactions
Real-time analytics
Streaming analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
Applications by Response Time and Workload
8
Real-time transactions
Real-time analytics
Streaming analytics
Batch analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
• Predictive models
• Fraud detection
Roles of Cassandra and Spark
9
Real-time transactions
Real-time analytics
Streaming analytics
Batch analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
• Predictive models
• Fraud detection
Numerous Endpoints
Geographically Distributed
Continuously Available
Instantaneously Responsive
Immediately Decisive
Predictably Scalable
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
10
Cassandra – Operational Database
• Millions of concurrent users
• Millisecond response time
• Linear scalability
• Always on
11
Cassandra – Operational Database
12
Spark – Analytics Platform
• Real-time, streaming and batch analytics
• Up to 100x faster than Hadoop
• Scalability, fault-resilience
• Versatile and rich API
13
Spark – Analytics Platform
14
SQL Streaming MLlib GraphX
Cluster Manager
Standalone YARN Mesos
Spark-Cassandra Connector
Open-Source Package for Spark
• Routine Spark-Cassandra interactions
– Read from and write into Cassandra
• Profound optimizations
– Predicate pushdown
– Data locality
– Cassandra-optimized joins
– Cassandra-aware partitioning
– Shuffle-free grouping
15
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
16
C*: Distributed, Shared Nothing, Peer-to-Peer
17
C* Client
C*
C*
C*
-263
+263
-1
Driver
C* Client Driver
transaction
transaction
transaction
transaction
C*: Partitioning and Replication
18
replica 2
replica 1
replica 3
coordinator
partitioner
partition
partition
key
write request
acknowledgment
CL=QUORUM
RF=3
......
TABLE
C*: Partitioning and Replication
19
replica 2
replica 1
replica 3
partition
partition
key
result
CL=ONE
RF=3
coordinator
partitioner
read request
Spark: Master-Worker, Failover Masters
20
Spark
Client
Driver
Master
Worker
SparkContext
Spark
Client
Driver
SparkContext
Worker
Worker
Executor Executor
Executor Executor
Executor Executor
Spark: Computation Scheduling
21
Driver
SparkContext
DAG Job 0
Stage 1
task task
task
Stage 0
task task
task
Stage 2
task task
task
Job 1
Stage 4
task task
task
Stage 3
task task
task
Stage 5
task task
task
Executor
task
cache
task
Executor
task
cache
task
Executor
task
cache
task
Spark-Cassandra Connector
22
C*
C*
C*
Master Worker
Executor
Executor
Spark-Cassandra Connector
Worker
Executor
Executor
Spark-Cassandra
Connector
Worker
Executor
Executor
Spark-Cassandra
Connector
Spark-Cassandra Connector
23
C*
C*
C*
Master Worker
Executor
Executor
Spark-Cassandra Connector
Worker
Executor
Executor
Spark-Cassandra
Connector
Worker
Executor
Executor
Spark-Cassandra
Connector
ClusterNode
Spark Node
Master JVM
Connector.jar
Worker JVM
Executor JVM
Executor JVM
C* Node
C* JVM
Multi-DC Deployment and Workload Separation
24
C* Client Driver Spark
Client
Driver
SparkContextC* Client Driver
C*
C*
C*
Master
Worker
Executor
WorkerWorker
C*
C*
C*
Executor
Executor
Executor
Executor Executor
Executor
C*C*
C*
Data
Replication
C* Client Driver
Spark
Client
Driver
SparkContext
real-time
transactions
interactive and
batch analytics
DC
Operations
DC
Analytics
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
25
Getting Started with Cassandra and Spark Applications
• Data Model and Cassandra Query Language
• Core Spark and Spark-Cassandra Connector
26
Keyspace and Replication
27
CREATE KEYSPACE iot
WITH replication = {'class': 'NetworkTopologyStrategy',
'DC-Kyiv-Operations' : 3,
'DC-Houston-Analytics': 2};
USE iot;
Table with Single-Row Partitions
28
username age address
Alice 28 Santa Clara, CA
Alex 37 Austin, TX
users CREATE TABLE users (
username TEXT,
age INT,
address TEXT,
PRIMARY KEY(username)
);
SELECT * FROM users
WHERE username = ?;
Table with Single-Row Partitions
29
id type settings owner
1 phone {gps ⇒ on,
pedometer ⇒ on}
Alice
2 wristband {heart rate ⇒ on, …} Alice
3 thermostat {temp ⇒ 75, …} Alice
4 security {…} Alex
5 phone {…} Alex
sensors CREATE TABLE sensors (
id INT,
type TEXT,
settings MAP<TEXT,TEXT>,
owner TEXT,
PRIMARY KEY(id)
);
SELECT * FROM sensors
WHERE id = ?;
Table with Multi-Row Partitions
30
username id type settings age address
Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA
Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA
Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA
Alex 4 security … 37 Austin, TX
Alex 5 phone … 37 Austin, TX
sensors_by_user
ASCASC
Table with Multi-Row Partitions
CREATE TABLE sensors_by_user (
username TEXT, age INT STATIC, address TEXT STATIC,
id INT, type TEXT, settings MAP<TEXT,TEXT>,
PRIMARY KEY(username, id)
) WITH CLUSTERING ORDER BY (id ASC);
SELECT * FROM sensors_by_user WHERE username = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id > ?
ORDER BY id DESC;
31
Retrieving Data from C*
• SparkContext, RDD, Connector
32
val rdd = sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
Predicate Pushdown
33
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.filter(row => row.getString("username") == "Alice")
• Suboptimal code
Predicate Pushdown
34
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.filter(row => row.getString("username") == "Alice")
.where("username = 'Alice'")
• Predicate pushed down to C*
Data Locality
input.split.size_in_mb input.consistency.level input.fetch.size_in_rows
35
Cassandra Spark
Node
(64) (LOCAL_ONE) (1000)
• Standard Spark join = shuffle + shuffle
Cassandra-Optimized Joins
36
val s = sc.cassandraTable("iot","sensors")
.keyBy(row => row.getString("owner"))
val u = sc.cassandraTable("iot","users")
.keyBy(row => row.getString("username"))
s.join(u)
Cassandra-Optimized Joins
37
• Shuffle
B
Partition 1
A C D
Map Task
Partition A
Reduce Task
B
Partition 2
A C D
Map Task
Partition B
Reduce Task
B
Partition 3
A C D
Map Task
Partition D
Reduce Task
Partition C
Reduce Task
Buckets:
memory
memory
Shuffle write
Shuffle read
disk
disk
Aggregation Aggregation Aggregation
Aggregation AggregationAggregationAggregation
• Connector join = no shuffle + no data locality
Cassandra-Optimized Joins
38
sc.cassandraTable("iot","sensors")
.select("id","type","owner".as("username"))
.joinWithCassandraTable("iot","users")
.on(SomeColumns("username"))
Cassandra-Optimized Joins
39
id type owner 
username
1 … Alice
4 … Alex
3 ... Alice
2 … Alice
5 … Alex
username age address
Alex 37 …
username age address
Alice 28 …
• Connector join + CAP = shuffle + data locality
Cassandra-Aware Partitioning
40
sc.cassandraTable("iot","sensors")
.select("id","type","owner".as("username"))
.repartitionByCassandraReplica("iot","users")
.joinWithCassandraTable("iot","users")
.on(SomeColumns("username"))
Cassandra-Aware Partitioning
41
id type owner 
username
1 … Alice
2 … Alice
3 ... Alice
username age address
Alex 37 …
username age address
Alice 28 …
id type owner 
username
4 … Alex
5 … Alex
• Suboptimal code
Shuffle-Free Grouping
42
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.as((u:String,i:Int,t:String)=>(u,(i,t)))
.groupByKey
• Shuffling eliminated at no extra cost
Shuffle-Free Grouping
43
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.as((u:String,i:Int,t:String)=>(u,(i,t)))
.groupByKeyspanByKey
Saving Data to C*
44
rdd.saveToCassandra("iot","users",
SomeColumns("username", "age"))
output.consistency.level (LOCAL_QUORUM)
output.batch.grouping.key (Partition)
output.batch.size.bytes (1024)
output.batch.grouping.buffer.size (1000)
output.concurrent.writes (5)
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
45
Artem Chebotko
achebotko@datastax.com
www.linkedin.com/in/artemchebotko
46

Contenu connexe

Tendances

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
 
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...Databricks
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovDatabricks
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksDatabricks
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Databricks
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Shirshanka Das
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 

Tendances (20)

Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
 
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
 
data-modeling-paper
data-modeling-paperdata-modeling-paper
data-modeling-paper
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
 
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 

En vedette

Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradusrandyguck
 
Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analyticsrandyguck
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorChristopher Batey
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strataPatrick McFadin
 
BigData in Health Care Systems with IOT
BigData in Health Care Systems with IOTBigData in Health Care Systems with IOT
BigData in Health Care Systems with IOTFaimin Khan
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
 
Data Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series DataData Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series DataDani Traphagen
 
JEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldJEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldSerg Masyutin
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 
Amuse UX 2015: Y.Vetrov — Platform Thinking
Amuse UX 2015: Y.Vetrov — Platform ThinkingAmuse UX 2015: Y.Vetrov — Platform Thinking
Amuse UX 2015: Y.Vetrov — Platform ThinkingYury Vetrov
 

En vedette (20)

Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradus
 
Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analytics
 
Cassandra London - C* Spark Connector
Cassandra London - C* Spark ConnectorCassandra London - C* Spark Connector
Cassandra London - C* Spark Connector
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Jee conf
Jee confJee conf
Jee conf
 
BigData in Health Care Systems with IOT
BigData in Health Care Systems with IOTBigData in Health Care Systems with IOT
BigData in Health Care Systems with IOT
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
ETL in Clojure
ETL in ClojureETL in Clojure
ETL in Clojure
 
Data Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series DataData Modeling with Cassandra and Time Series Data
Data Modeling with Cassandra and Time Series Data
 
JEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldJEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java World
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
Amuse UX 2015: Y.Vetrov — Platform Thinking
Amuse UX 2015: Y.Vetrov — Platform ThinkingAmuse UX 2015: Y.Vetrov — Platform Thinking
Amuse UX 2015: Y.Vetrov — Platform Thinking
 

Similaire à Big Data-Driven Applications with Cassandra and Spark

Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLNicolas Poggi
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaServing Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaRedis Labs
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Qbeast
 
Marek Suplata Projects
Marek Suplata ProjectsMarek Suplata Projects
Marek Suplata Projectsguest14f12f
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Introduction no sql solutions with couchbase and .net core
Introduction no sql solutions with couchbase and .net coreIntroduction no sql solutions with couchbase and .net core
Introduction no sql solutions with couchbase and .net coreBaris Ceviz
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs? ScyllaDB
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScyllaDB
 
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsTeradata Aster
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Keshav Murthy
 

Similaire à Big Data-Driven Applications with Cassandra and Spark (20)

Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca AntigaServing Deep Learning Models At Scale With RedisAI: Luca Antiga
Serving Deep Learning Models At Scale With RedisAI: Luca Antiga
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
ql.io at NodePDX
ql.io at NodePDXql.io at NodePDX
ql.io at NodePDX
 
Presentation
PresentationPresentation
Presentation
 
Marek Suplata Projects
Marek Suplata ProjectsMarek Suplata Projects
Marek Suplata Projects
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Introduction no sql solutions with couchbase and .net core
Introduction no sql solutions with couchbase and .net coreIntroduction no sql solutions with couchbase and .net core
Introduction no sql solutions with couchbase and .net core
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Fast NoSQL from HDDs?
Fast NoSQL from HDDs? Fast NoSQL from HDDs?
Fast NoSQL from HDDs?
 
dfl
dfldfl
dfl
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
 
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 

Big Data-Driven Applications with Cassandra and Spark

  • 1. Big Data-Driven Applications with Cassandra and Spark Artem Chebotko, Ph.D. Solution Architect
  • 2. 1 Modern Big Data and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 2
  • 3. Modern Application Requirements • Numerous Endpoints • Geographically Distributed • Continuously Available • Instantaneously Responsive • Immediately Decisive • Predictably Scalable 3
  • 4. Applications by Response Time and Workload 4 Analytical (OLAP)Operational (OLTP)
  • 5. Applications by Response Time and Workload 5 Real-time transactions Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions
  • 6. Applications by Response Time and Workload 6 Real-time transactions Real-time analytics Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions • Recommendations
  • 7. Applications by Response Time and Workload 7 Real-time transactions Real-time analytics Streaming analytics Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions • Recommendations • Fraud prevention
  • 8. Applications by Response Time and Workload 8 Real-time transactions Real-time analytics Streaming analytics Batch analytics Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions • Recommendations • Fraud prevention • Predictive models • Fraud detection
  • 9. Roles of Cassandra and Spark 9 Real-time transactions Real-time analytics Streaming analytics Batch analytics Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions • Recommendations • Fraud prevention • Predictive models • Fraud detection Numerous Endpoints Geographically Distributed Continuously Available Instantaneously Responsive Immediately Decisive Predictably Scalable
  • 10. 1 Modern Big Data and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 10
  • 11. Cassandra – Operational Database • Millions of concurrent users • Millisecond response time • Linear scalability • Always on 11
  • 13. Spark – Analytics Platform • Real-time, streaming and batch analytics • Up to 100x faster than Hadoop • Scalability, fault-resilience • Versatile and rich API 13
  • 14. Spark – Analytics Platform 14 SQL Streaming MLlib GraphX Cluster Manager Standalone YARN Mesos
  • 15. Spark-Cassandra Connector Open-Source Package for Spark • Routine Spark-Cassandra interactions – Read from and write into Cassandra • Profound optimizations – Predicate pushdown – Data locality – Cassandra-optimized joins – Cassandra-aware partitioning – Shuffle-free grouping 15
  • 16. 1 Modern Big Data and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 16
  • 17. C*: Distributed, Shared Nothing, Peer-to-Peer 17 C* Client C* C* C* -263 +263 -1 Driver C* Client Driver transaction transaction transaction transaction
  • 18. C*: Partitioning and Replication 18 replica 2 replica 1 replica 3 coordinator partitioner partition partition key write request acknowledgment CL=QUORUM RF=3 ...... TABLE
  • 19. C*: Partitioning and Replication 19 replica 2 replica 1 replica 3 partition partition key result CL=ONE RF=3 coordinator partitioner read request
  • 20. Spark: Master-Worker, Failover Masters 20 Spark Client Driver Master Worker SparkContext Spark Client Driver SparkContext Worker Worker Executor Executor Executor Executor Executor Executor
  • 21. Spark: Computation Scheduling 21 Driver SparkContext DAG Job 0 Stage 1 task task task Stage 0 task task task Stage 2 task task task Job 1 Stage 4 task task task Stage 3 task task task Stage 5 task task task Executor task cache task Executor task cache task Executor task cache task
  • 22. Spark-Cassandra Connector 22 C* C* C* Master Worker Executor Executor Spark-Cassandra Connector Worker Executor Executor Spark-Cassandra Connector Worker Executor Executor Spark-Cassandra Connector
  • 23. Spark-Cassandra Connector 23 C* C* C* Master Worker Executor Executor Spark-Cassandra Connector Worker Executor Executor Spark-Cassandra Connector Worker Executor Executor Spark-Cassandra Connector ClusterNode Spark Node Master JVM Connector.jar Worker JVM Executor JVM Executor JVM C* Node C* JVM
  • 24. Multi-DC Deployment and Workload Separation 24 C* Client Driver Spark Client Driver SparkContextC* Client Driver C* C* C* Master Worker Executor WorkerWorker C* C* C* Executor Executor Executor Executor Executor Executor C*C* C* Data Replication C* Client Driver Spark Client Driver SparkContext real-time transactions interactive and batch analytics DC Operations DC Analytics
  • 25. 1 Modern Big Data and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 25
  • 26. Getting Started with Cassandra and Spark Applications • Data Model and Cassandra Query Language • Core Spark and Spark-Cassandra Connector 26
  • 27. Keyspace and Replication 27 CREATE KEYSPACE iot WITH replication = {'class': 'NetworkTopologyStrategy', 'DC-Kyiv-Operations' : 3, 'DC-Houston-Analytics': 2}; USE iot;
  • 28. Table with Single-Row Partitions 28 username age address Alice 28 Santa Clara, CA Alex 37 Austin, TX users CREATE TABLE users ( username TEXT, age INT, address TEXT, PRIMARY KEY(username) ); SELECT * FROM users WHERE username = ?;
  • 29. Table with Single-Row Partitions 29 id type settings owner 1 phone {gps ⇒ on, pedometer ⇒ on} Alice 2 wristband {heart rate ⇒ on, …} Alice 3 thermostat {temp ⇒ 75, …} Alice 4 security {…} Alex 5 phone {…} Alex sensors CREATE TABLE sensors ( id INT, type TEXT, settings MAP<TEXT,TEXT>, owner TEXT, PRIMARY KEY(id) ); SELECT * FROM sensors WHERE id = ?;
  • 30. Table with Multi-Row Partitions 30 username id type settings age address Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA Alex 4 security … 37 Austin, TX Alex 5 phone … 37 Austin, TX sensors_by_user ASCASC
  • 31. Table with Multi-Row Partitions CREATE TABLE sensors_by_user ( username TEXT, age INT STATIC, address TEXT STATIC, id INT, type TEXT, settings MAP<TEXT,TEXT>, PRIMARY KEY(username, id) ) WITH CLUSTERING ORDER BY (id ASC); SELECT * FROM sensors_by_user WHERE username = ?; SELECT * FROM sensors_by_user WHERE username = ? AND id = ?; SELECT * FROM sensors_by_user WHERE username = ? AND id > ? ORDER BY id DESC; 31
  • 32. Retrieving Data from C* • SparkContext, RDD, Connector 32 val rdd = sc.cassandraTable("iot","sensors_by_user") .select("username","id","type")
  • 34. Predicate Pushdown 34 sc.cassandraTable("iot","sensors_by_user") .select("username","id","type") .filter(row => row.getString("username") == "Alice") .where("username = 'Alice'") • Predicate pushed down to C*
  • 35. Data Locality input.split.size_in_mb input.consistency.level input.fetch.size_in_rows 35 Cassandra Spark Node (64) (LOCAL_ONE) (1000)
  • 36. • Standard Spark join = shuffle + shuffle Cassandra-Optimized Joins 36 val s = sc.cassandraTable("iot","sensors") .keyBy(row => row.getString("owner")) val u = sc.cassandraTable("iot","users") .keyBy(row => row.getString("username")) s.join(u)
  • 37. Cassandra-Optimized Joins 37 • Shuffle B Partition 1 A C D Map Task Partition A Reduce Task B Partition 2 A C D Map Task Partition B Reduce Task B Partition 3 A C D Map Task Partition D Reduce Task Partition C Reduce Task Buckets: memory memory Shuffle write Shuffle read disk disk Aggregation Aggregation Aggregation Aggregation AggregationAggregationAggregation
  • 38. • Connector join = no shuffle + no data locality Cassandra-Optimized Joins 38 sc.cassandraTable("iot","sensors") .select("id","type","owner".as("username")) .joinWithCassandraTable("iot","users") .on(SomeColumns("username"))
  • 39. Cassandra-Optimized Joins 39 id type owner  username 1 … Alice 4 … Alex 3 ... Alice 2 … Alice 5 … Alex username age address Alex 37 … username age address Alice 28 …
  • 40. • Connector join + CAP = shuffle + data locality Cassandra-Aware Partitioning 40 sc.cassandraTable("iot","sensors") .select("id","type","owner".as("username")) .repartitionByCassandraReplica("iot","users") .joinWithCassandraTable("iot","users") .on(SomeColumns("username"))
  • 41. Cassandra-Aware Partitioning 41 id type owner  username 1 … Alice 2 … Alice 3 ... Alice username age address Alex 37 … username age address Alice 28 … id type owner  username 4 … Alex 5 … Alex
  • 42. • Suboptimal code Shuffle-Free Grouping 42 sc.cassandraTable("iot","sensors_by_user") .select("username","id","type") .as((u:String,i:Int,t:String)=>(u,(i,t))) .groupByKey
  • 43. • Shuffling eliminated at no extra cost Shuffle-Free Grouping 43 sc.cassandraTable("iot","sensors_by_user") .select("username","id","type") .as((u:String,i:Int,t:String)=>(u,(i,t))) .groupByKeyspanByKey
  • 44. Saving Data to C* 44 rdd.saveToCassandra("iot","users", SomeColumns("username", "age")) output.consistency.level (LOCAL_QUORUM) output.batch.grouping.key (Partition) output.batch.size.bytes (1024) output.batch.grouping.buffer.size (1000) output.concurrent.writes (5)
  • 45. 1 Modern Big Data and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 45

Notes de l'éditeur

  1. Apple-scale Transactions per second 1000-node clusters Multiple data centers
  2. Personalization Product catalogs Fraud detection Internet of Things Messaging
  3. In-memory
  4. This is a quick review of what you already know about Cassandra: Peer-to-peer architecture Failure tolerance/availability Cassandra token ring Data structures (table) Data distribution (partition key) Data replication (replication factor) Data consistency (consistency level)
  5. Master-worker architecture Master (aka cluster manager) manages Workers and their resources Workers instantiate Executors and give them resources (cores and memory) Driver schedules computation directly with Executors Failure tolerance Worker/Executor - no problem; computation is picked up by remaining Workers/Executors Master - new Master is elected (DSE feature); running applications are not affected; new applications are only affected temporarily until new Master is started
  6. spark.cassandra.input.split.size_in_mb (64) spark.cassandra.input.consistency.level (LOCAL_ONE) spark.cassandra.input.fetch.size_in_rows (1000)
  7. Shuffling costs Disk IO Network traffic Partitioning External sorting Serialization Deserialization Compression