SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Mixing Batch and Real-time: Cassandra with Shark

Richard Low | @richardalow

#CASSANDRAEU

CASSANDRASUMMITEU
About me

* Analytics tech lead at SwiftKey
* Cassandra freelancer
* Previous: lead Cassandra and Analytics dev at
Acunu

#CASSANDRAEU

@richardalow
Outline

* Batch analytics on real-time databases
* Current solutions
* Spark and Shark
* My solution
* Performance results
* Summary & future work

#CASSANDRAEU

@richardalow
Batch analytics on real-time databases

#CASSANDRAEU

@richardalow
Batch and real-time analytics

* Wherever there is data there are unforeseeable

queries
* Real-time databases are optimized for real-time
queries
* Large queries may not be possible
* Or will impact your real-time SLA

#CASSANDRAEU

@richardalow
Example

* User accounts database
* Read-heavy
* Must be low latency
* Other tables on same database
* Some are write heavy
* A good fit for Cassandra!

#CASSANDRAEU

@richardalow
Example data model
CREATE TABLE user_accounts (
userid uuid PRIMARY KEY,
username text,
email text,
password text,
last_visited timestamp,
country text
);

#CASSANDRAEU

@richardalow
Example data model
SELECT * FROM user_accounts LIMIT 2;
userid
| country | email
| last_visited
| password | username
---------+---------+---------------------+---------------------+----------+--------a03dcf03 |
UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow
b3f1871e |
FR | jean@yahoo.com
| 2013-08-17 13:07:36 | moh7eksn | jean88

#CASSANDRAEU

@richardalow
Marketing walks in

#CASSANDRAEU

@richardalow
Ad-hoc query

“Please can you find all users from Brazil who haven’t
logged in since July and have an email @yahoo.com.
I need the answer by Monday.”

#CASSANDRAEU

@richardalow
Ad-hoc query observations

* We have 500k users from Brazil
* 60MB of raw data
* No way to extract by country from data model
* It’s on unchanging data*
* Can take hours, not days
* No expectation this query will need rerunning
* Mostly, some of the people who haven’t visited for a while may suddenly come back

#CASSANDRAEU

@richardalow
Why?

* Underrepresented use case in plethora of tools
* Seen days of dev time wasted
* Want to see what can be done

#CASSANDRAEU

@richardalow
Current solutions

#CASSANDRAEU

@richardalow
Options

* Run Hive query on top of Cassandra

#CASSANDRAEU

@richardalow
Options

* Run Hive query on top of Cassandra
* Will compete with Cassandra for
* I/O
* Memory
* CPU
* Network
* Will cause extra GC pressure on Cassandra
* Could flush filesystem cache

#CASSANDRAEU

@richardalow
Options

* Write ETL script and load into another DB

#CASSANDRAEU

@richardalow
Options

* Write ETL script and load into another DB
* All custom code
* Single threaded
* Unreliable
* Will still flush cache on Cassandra nodes

#CASSANDRAEU

@richardalow
Options

* Clone the cluster

#CASSANDRAEU

@richardalow
Options

* Clone the cluster
* Worst possible network load
* Manual import each time
* No incremental update
* Need duplicate hardware

#CASSANDRAEU

@richardalow
Options

* Add ‘batch analytics’ DC and run Hive there

#CASSANDRAEU

@richardalow
Options

* Add ‘batch analytics’ DC and run Hive there
* Initial copy slow and affects real-time
performance
* Need duplicate hardware
* Will drop writes when really busy

#CASSANDRAEU

@richardalow
Spark and Shark

#CASSANDRAEU

@richardalow
Spark

* Developed by Amplab
* Distributed computation, like Hadoop
* Designed for iterative algorithms
* Much faster for queries with working sets that fit
in RAM
* Reliability from storing lineage rather than
intermediate results
* Runs on Mesos or YARN

#CASSANDRAEU

@richardalow
Spark is used by

Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
#CASSANDRAEU

@richardalow
Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables

#CASSANDRAEU

@richardalow
Shark

* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;

#CASSANDRAEU

@richardalow
Shark on Cassandra

#CASSANDRAEU

@richardalow
Shark on Cassandra

* CqlStorageHandler
* Can use existing hive-cassandra storage handler
* Can work well - see Evan Chan’s talk (Ooyala) from
#cassandra13
* But suffers from same problems as Hive+Hadoop
on Cassandra

#CASSANDRAEU

@richardalow
Shark on Cassandra direct

* SSTableStorageHandler
* Run spark workers on the Cassandra nodes
* Read directly from SSTables in separate JVM
* Limit CPU and memory through Spark/Mesos/
YARN
* Limit I/O by rate limiting raw disk access
* Skip filesystem cache

#CASSANDRAEU

@richardalow
Cassandra on Spark: through CQL interface
Spark worker JVM

FS Cache

Cassandra JVM
Deserialize
Merge
Serialize

SSTables

Deserialize
Process

Remote client
Latency
spikes!

#CASSANDRAEU

@richardalow
Cassandra on Spark: SSTables direct
Spark worker JVM
Deserialize
Process

SSTables

#CASSANDRAEU

Remote client

Deserialize
Merge
Serialize

FS Cache

Cassandra JVM

Constant
latency

@richardalow
Disadvantages

* Equivalent to CL.ONE
* Always runs task local with the data
* Doesn’t read data in memtables

#CASSANDRAEU

@richardalow
Performance results

#CASSANDRAEU

@richardalow
Testing

* 4 node Cassandra cluster on m1.large
* 2 cores, 7.5 GB RAM, 2 ephemeral disks
* 1 spark master
* Spark running on Cassandra nodes
* Limited to 1 core, 1 GB RAM
* Compare CQLStorageHandler with
SSTableStorageHandler

#CASSANDRAEU

@richardalow
Setup

* Cassandra 1.2.10
* 3 GB heap
* 256 tokens per node
* RF 3
* Preloaded 100M randomly generated records
* Each node started with 9GB of data
* No optimization or tuning

#CASSANDRAEU

@richardalow
Tools

* codahale Metrics
* Ganglia
* Load generator using DataStax Java driver
* Google spreadsheet

#CASSANDRAEU

@richardalow
Result 1

* No Cassandra load
* Run caching query:
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;

* Takes 33 mins through CQL
* Takes 13 mins through SSTables
* 130k records/s
* => SSTables is 2.5x faster
* Even better since CQL has access to both cores
#CASSANDRAEU

@richardalow
Using cached results

* Now have results cached, can run super fast
queries
* No I/O or extra memory
* Bounded number of cores

SELECT count(*) FROM user_accounts_cached
WHERE unix_timestamp(last_visited)<
unix_timestamp('2013-08-01 00:00:00') AND
email LIKE '%@c9%';

* Took 18 seconds
#CASSANDRAEU

@richardalow
Result 2

* Add read load
* Read-modify-write of accounts info
* 200 ops/s
* Measure latency
* Slow down SSTable loader to same rate as CQL

#CASSANDRAEU

@richardalow
95%ile base

mean base

#CASSANDRAEU

@richardalow
Analysis

* Average latency 17% lower
* Probably due to less CPU used by query
* Max 95th %ile latency 33% lower and much more
predictable
* Possibly due to less GC pressure
* Still have a latency increase over base
* Probably due to I/O use

#CASSANDRAEU

@richardalow
Result 3

* Keep read workload
* Measure same latency
* Add insert workload
* Insert into separate table
* 2500 ops/s

#CASSANDRAEU

@richardalow
CQL loader

#CASSANDRAEU

SSTable loader

@richardalow
Analysis

* Lots of latency, but there is anyway

#CASSANDRAEU

@richardalow
Performance wrap up

* 2.5x faster with less CPU

=> uses less resources to do the same thing
* Lower, more predictable latencies when at same
speed
=> controlled resource usage lowers latency
impact
* Could limit further to make impact unnoticeable

#CASSANDRAEU

@richardalow
Summary

#CASSANDRAEU

@richardalow
Summary

* Discussed analytics use case not well served by
current tools
* Spark, Shark
* SSTableStorageHandler
* Performance results

#CASSANDRAEU

@richardalow
Future

* Needs a name
* Github
* Speak to me if you want to use it
* Speak to me if you want to contribute

#CASSANDRAEU

@richardalow
Thank you!
Richard Low | @richardalow

#CASSANDRAEU

@richardalow

Contenu connexe

Tendances

Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXzznate
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCaleb Rackliffe
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into CassandraBrian Hess
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...DataStax
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
 
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandraJon Haddad
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCarlos Alonso Pérez
 
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_developeIntroduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_developezznate
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014Patrick McFadin
 
Apache cassandra en production - devoxx 2017
Apache cassandra en production  - devoxx 2017Apache cassandra en production  - devoxx 2017
Apache cassandra en production - devoxx 2017Alexander DEJANOVSKI
 
Monitoring Cassandra with Riemann
Monitoring Cassandra with RiemannMonitoring Cassandra with Riemann
Monitoring Cassandra with RiemannPatricia Gorla
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + ElkVasil Remeniuk
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsChristopher Batey
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...DataStax Academy
 
Helsinki Cassandra Meetup #2: From Postgres to Cassandra
Helsinki Cassandra Meetup #2: From Postgres to CassandraHelsinki Cassandra Meetup #2: From Postgres to Cassandra
Helsinki Cassandra Meetup #2: From Postgres to CassandraBruno Amaro Almeida
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark datastaxjp
 

Tendances (20)

Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandra
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one day
 
Introduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_developeIntroduction to apache_cassandra_for_develope
Introduction to apache_cassandra_for_develope
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014
 
Apache cassandra en production - devoxx 2017
Apache cassandra en production  - devoxx 2017Apache cassandra en production  - devoxx 2017
Apache cassandra en production - devoxx 2017
 
Monitoring Cassandra with Riemann
Monitoring Cassandra with RiemannMonitoring Cassandra with Riemann
Monitoring Cassandra with Riemann
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
Cassandra Summit 2014: Reading Cassandra SSTables Directly for Offline Data A...
 
Elassandra
ElassandraElassandra
Elassandra
 
Helsinki Cassandra Meetup #2: From Postgres to Cassandra
Helsinki Cassandra Meetup #2: From Postgres to CassandraHelsinki Cassandra Meetup #2: From Postgres to Cassandra
Helsinki Cassandra Meetup #2: From Postgres to Cassandra
 
Cassandra and Spark
Cassandra and Spark Cassandra and Spark
Cassandra and Spark
 

Similaire à C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandrazznate
 
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1DataStax Academy
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013jbellis
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyDataStax Academy
 
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013odnoklassniki.ru
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideMohammed Fazuluddin
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Dave Gardner
 
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi DataStax Academy
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationCesare Cugnasco
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...DataStax Academy
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraMichael Kjellman
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanDataStax Academy
 
How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17Tom Arnfeld
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And BeyondRomain Hardouin
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introductionfardinjamshidi
 

Similaire à C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark (20)

Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
C* Summit EU 2013: Keynote by Jonathan Ellis — Cassandra 2.0 & 2.1
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
 
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael KjellmanC* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
C* Summit 2013 - Hindsight is 20/20. MySQL to Cassandra by Michael Kjellman
 
How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17How Cloudflare analyzes -1m dns queries per second @ Percona E17
How Cloudflare analyzes -1m dns queries per second @ Percona E17
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And Beyond
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 

Plus de DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

Plus de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark

  • 1. Mixing Batch and Real-time: Cassandra with Shark Richard Low | @richardalow #CASSANDRAEU CASSANDRASUMMITEU
  • 2. About me * Analytics tech lead at SwiftKey * Cassandra freelancer * Previous: lead Cassandra and Analytics dev at Acunu #CASSANDRAEU @richardalow
  • 3. Outline * Batch analytics on real-time databases * Current solutions * Spark and Shark * My solution * Performance results * Summary & future work #CASSANDRAEU @richardalow
  • 4. Batch analytics on real-time databases #CASSANDRAEU @richardalow
  • 5. Batch and real-time analytics * Wherever there is data there are unforeseeable queries * Real-time databases are optimized for real-time queries * Large queries may not be possible * Or will impact your real-time SLA #CASSANDRAEU @richardalow
  • 6. Example * User accounts database * Read-heavy * Must be low latency * Other tables on same database * Some are write heavy * A good fit for Cassandra! #CASSANDRAEU @richardalow
  • 7. Example data model CREATE TABLE user_accounts ( userid uuid PRIMARY KEY, username text, email text, password text, last_visited timestamp, country text ); #CASSANDRAEU @richardalow
  • 8. Example data model SELECT * FROM user_accounts LIMIT 2; userid | country | email | last_visited | password | username ---------+---------+---------------------+---------------------+----------+--------a03dcf03 | UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow b3f1871e | FR | jean@yahoo.com | 2013-08-17 13:07:36 | moh7eksn | jean88 #CASSANDRAEU @richardalow
  • 10. Ad-hoc query “Please can you find all users from Brazil who haven’t logged in since July and have an email @yahoo.com. I need the answer by Monday.” #CASSANDRAEU @richardalow
  • 11. Ad-hoc query observations * We have 500k users from Brazil * 60MB of raw data * No way to extract by country from data model * It’s on unchanging data* * Can take hours, not days * No expectation this query will need rerunning * Mostly, some of the people who haven’t visited for a while may suddenly come back #CASSANDRAEU @richardalow
  • 12. Why? * Underrepresented use case in plethora of tools * Seen days of dev time wasted * Want to see what can be done #CASSANDRAEU @richardalow
  • 14. Options * Run Hive query on top of Cassandra #CASSANDRAEU @richardalow
  • 15. Options * Run Hive query on top of Cassandra * Will compete with Cassandra for * I/O * Memory * CPU * Network * Will cause extra GC pressure on Cassandra * Could flush filesystem cache #CASSANDRAEU @richardalow
  • 16. Options * Write ETL script and load into another DB #CASSANDRAEU @richardalow
  • 17. Options * Write ETL script and load into another DB * All custom code * Single threaded * Unreliable * Will still flush cache on Cassandra nodes #CASSANDRAEU @richardalow
  • 18. Options * Clone the cluster #CASSANDRAEU @richardalow
  • 19. Options * Clone the cluster * Worst possible network load * Manual import each time * No incremental update * Need duplicate hardware #CASSANDRAEU @richardalow
  • 20. Options * Add ‘batch analytics’ DC and run Hive there #CASSANDRAEU @richardalow
  • 21. Options * Add ‘batch analytics’ DC and run Hive there * Initial copy slow and affects real-time performance * Need duplicate hardware * Will drop writes when really busy #CASSANDRAEU @richardalow
  • 23. Spark * Developed by Amplab * Distributed computation, like Hadoop * Designed for iterative algorithms * Much faster for queries with working sets that fit in RAM * Reliability from storing lineage rather than intermediate results * Runs on Mesos or YARN #CASSANDRAEU @richardalow
  • 24. Spark is used by Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark #CASSANDRAEU @richardalow
  • 25. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables #CASSANDRAEU @richardalow
  • 26. Shark * Hive on Spark * Completely compatible with Hive * Same QL, UDFs and storage handlers * Can cache tables CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; #CASSANDRAEU @richardalow
  • 28. Shark on Cassandra * CqlStorageHandler * Can use existing hive-cassandra storage handler * Can work well - see Evan Chan’s talk (Ooyala) from #cassandra13 * But suffers from same problems as Hive+Hadoop on Cassandra #CASSANDRAEU @richardalow
  • 29. Shark on Cassandra direct * SSTableStorageHandler * Run spark workers on the Cassandra nodes * Read directly from SSTables in separate JVM * Limit CPU and memory through Spark/Mesos/ YARN * Limit I/O by rate limiting raw disk access * Skip filesystem cache #CASSANDRAEU @richardalow
  • 30. Cassandra on Spark: through CQL interface Spark worker JVM FS Cache Cassandra JVM Deserialize Merge Serialize SSTables Deserialize Process Remote client Latency spikes! #CASSANDRAEU @richardalow
  • 31. Cassandra on Spark: SSTables direct Spark worker JVM Deserialize Process SSTables #CASSANDRAEU Remote client Deserialize Merge Serialize FS Cache Cassandra JVM Constant latency @richardalow
  • 32. Disadvantages * Equivalent to CL.ONE * Always runs task local with the data * Doesn’t read data in memtables #CASSANDRAEU @richardalow
  • 34. Testing * 4 node Cassandra cluster on m1.large * 2 cores, 7.5 GB RAM, 2 ephemeral disks * 1 spark master * Spark running on Cassandra nodes * Limited to 1 core, 1 GB RAM * Compare CQLStorageHandler with SSTableStorageHandler #CASSANDRAEU @richardalow
  • 35. Setup * Cassandra 1.2.10 * 3 GB heap * 256 tokens per node * RF 3 * Preloaded 100M randomly generated records * Each node started with 9GB of data * No optimization or tuning #CASSANDRAEU @richardalow
  • 36. Tools * codahale Metrics * Ganglia * Load generator using DataStax Java driver * Google spreadsheet #CASSANDRAEU @richardalow
  • 37. Result 1 * No Cassandra load * Run caching query: CREATE TABLE user_accounts_cached as SELECT * FROM user_accounts WHERE country = ‘BR’; * Takes 33 mins through CQL * Takes 13 mins through SSTables * 130k records/s * => SSTables is 2.5x faster * Even better since CQL has access to both cores #CASSANDRAEU @richardalow
  • 38. Using cached results * Now have results cached, can run super fast queries * No I/O or extra memory * Bounded number of cores SELECT count(*) FROM user_accounts_cached WHERE unix_timestamp(last_visited)< unix_timestamp('2013-08-01 00:00:00') AND email LIKE '%@c9%'; * Took 18 seconds #CASSANDRAEU @richardalow
  • 39. Result 2 * Add read load * Read-modify-write of accounts info * 200 ops/s * Measure latency * Slow down SSTable loader to same rate as CQL #CASSANDRAEU @richardalow
  • 41. Analysis * Average latency 17% lower * Probably due to less CPU used by query * Max 95th %ile latency 33% lower and much more predictable * Possibly due to less GC pressure * Still have a latency increase over base * Probably due to I/O use #CASSANDRAEU @richardalow
  • 42. Result 3 * Keep read workload * Measure same latency * Add insert workload * Insert into separate table * 2500 ops/s #CASSANDRAEU @richardalow
  • 44. Analysis * Lots of latency, but there is anyway #CASSANDRAEU @richardalow
  • 45. Performance wrap up * 2.5x faster with less CPU => uses less resources to do the same thing * Lower, more predictable latencies when at same speed => controlled resource usage lowers latency impact * Could limit further to make impact unnoticeable #CASSANDRAEU @richardalow
  • 47. Summary * Discussed analytics use case not well served by current tools * Spark, Shark * SSTableStorageHandler * Performance results #CASSANDRAEU @richardalow
  • 48. Future * Needs a name * Github * Speak to me if you want to use it * Speak to me if you want to contribute #CASSANDRAEU @richardalow
  • 49. Thank you! Richard Low | @richardalow #CASSANDRAEU @richardalow