Speaker: Richard Low, Analytics Tech Lead at SwiftKey
Video: http://www.youtube.com/watch?v=QTb4HTwVMq0&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=2
Everything Cassandra does is designed for a real-time workload of high volume inserts and frequent small queries. Cassandra has Hadoop and Hive integration, but performing long running ad-hoc queries with these tools is difficult without impacting real-time performance or requires duplicate clusters. This talk will explain how I'm integrating Cassandra with Shark, a drop-in Hive replacement developed by Berkeley's AmpLab. It's designed to give fine grained control over all resource usage so you can safely run arbitrary ad-hoc queries on your existing cluster with controlled and predictable impact.
5. Batch and real-time analytics
* Wherever there is data there are unforeseeable
queries
* Real-time databases are optimized for real-time
queries
* Large queries may not be possible
* Or will impact your real-time SLA
#CASSANDRAEU
@richardalow
6. Example
* User accounts database
* Read-heavy
* Must be low latency
* Other tables on same database
* Some are write heavy
* A good fit for Cassandra!
#CASSANDRAEU
@richardalow
7. Example data model
CREATE TABLE user_accounts (
userid uuid PRIMARY KEY,
username text,
email text,
password text,
last_visited timestamp,
country text
);
#CASSANDRAEU
@richardalow
8. Example data model
SELECT * FROM user_accounts LIMIT 2;
userid
| country | email
| last_visited
| password | username
---------+---------+---------------------+---------------------+----------+--------a03dcf03 |
UK | richard@wentnet.com | 2013-10-07 09:07:36 | td7rjxwp | rlow
b3f1871e |
FR | jean@yahoo.com
| 2013-08-17 13:07:36 | moh7eksn | jean88
#CASSANDRAEU
@richardalow
10. Ad-hoc query
“Please can you find all users from Brazil who haven’t
logged in since July and have an email @yahoo.com.
I need the answer by Monday.”
#CASSANDRAEU
@richardalow
11. Ad-hoc query observations
* We have 500k users from Brazil
* 60MB of raw data
* No way to extract by country from data model
* It’s on unchanging data*
* Can take hours, not days
* No expectation this query will need rerunning
* Mostly, some of the people who haven’t visited for a while may suddenly come back
#CASSANDRAEU
@richardalow
12. Why?
* Underrepresented use case in plethora of tools
* Seen days of dev time wasted
* Want to see what can be done
#CASSANDRAEU
@richardalow
15. Options
* Run Hive query on top of Cassandra
* Will compete with Cassandra for
* I/O
* Memory
* CPU
* Network
* Will cause extra GC pressure on Cassandra
* Could flush filesystem cache
#CASSANDRAEU
@richardalow
16. Options
* Write ETL script and load into another DB
#CASSANDRAEU
@richardalow
17. Options
* Write ETL script and load into another DB
* All custom code
* Single threaded
* Unreliable
* Will still flush cache on Cassandra nodes
#CASSANDRAEU
@richardalow
19. Options
* Clone the cluster
* Worst possible network load
* Manual import each time
* No incremental update
* Need duplicate hardware
#CASSANDRAEU
@richardalow
21. Options
* Add ‘batch analytics’ DC and run Hive there
* Initial copy slow and affects real-time
performance
* Need duplicate hardware
* Will drop writes when really busy
#CASSANDRAEU
@richardalow
23. Spark
* Developed by Amplab
* Distributed computation, like Hadoop
* Designed for iterative algorithms
* Much faster for queries with working sets that fit
in RAM
* Reliability from storing lineage rather than
intermediate results
* Runs on Mesos or YARN
#CASSANDRAEU
@richardalow
24. Spark is used by
Source: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
#CASSANDRAEU
@richardalow
25. Shark
* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
#CASSANDRAEU
@richardalow
26. Shark
* Hive on Spark
* Completely compatible with Hive
* Same QL, UDFs and storage handlers
* Can cache tables
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;
#CASSANDRAEU
@richardalow
28. Shark on Cassandra
* CqlStorageHandler
* Can use existing hive-cassandra storage handler
* Can work well - see Evan Chan’s talk (Ooyala) from
#cassandra13
* But suffers from same problems as Hive+Hadoop
on Cassandra
#CASSANDRAEU
@richardalow
29. Shark on Cassandra direct
* SSTableStorageHandler
* Run spark workers on the Cassandra nodes
* Read directly from SSTables in separate JVM
* Limit CPU and memory through Spark/Mesos/
YARN
* Limit I/O by rate limiting raw disk access
* Skip filesystem cache
#CASSANDRAEU
@richardalow
30. Cassandra on Spark: through CQL interface
Spark worker JVM
FS Cache
Cassandra JVM
Deserialize
Merge
Serialize
SSTables
Deserialize
Process
Remote client
Latency
spikes!
#CASSANDRAEU
@richardalow
31. Cassandra on Spark: SSTables direct
Spark worker JVM
Deserialize
Process
SSTables
#CASSANDRAEU
Remote client
Deserialize
Merge
Serialize
FS Cache
Cassandra JVM
Constant
latency
@richardalow
32. Disadvantages
* Equivalent to CL.ONE
* Always runs task local with the data
* Doesn’t read data in memtables
#CASSANDRAEU
@richardalow
35. Setup
* Cassandra 1.2.10
* 3 GB heap
* 256 tokens per node
* RF 3
* Preloaded 100M randomly generated records
* Each node started with 9GB of data
* No optimization or tuning
#CASSANDRAEU
@richardalow
36. Tools
* codahale Metrics
* Ganglia
* Load generator using DataStax Java driver
* Google spreadsheet
#CASSANDRAEU
@richardalow
37. Result 1
* No Cassandra load
* Run caching query:
CREATE TABLE user_accounts_cached as
SELECT * FROM user_accounts WHERE
country = ‘BR’;
* Takes 33 mins through CQL
* Takes 13 mins through SSTables
* 130k records/s
* => SSTables is 2.5x faster
* Even better since CQL has access to both cores
#CASSANDRAEU
@richardalow
38. Using cached results
* Now have results cached, can run super fast
queries
* No I/O or extra memory
* Bounded number of cores
SELECT count(*) FROM user_accounts_cached
WHERE unix_timestamp(last_visited)<
unix_timestamp('2013-08-01 00:00:00') AND
email LIKE '%@c9%';
* Took 18 seconds
#CASSANDRAEU
@richardalow
39. Result 2
* Add read load
* Read-modify-write of accounts info
* 200 ops/s
* Measure latency
* Slow down SSTable loader to same rate as CQL
#CASSANDRAEU
@richardalow
41. Analysis
* Average latency 17% lower
* Probably due to less CPU used by query
* Max 95th %ile latency 33% lower and much more
predictable
* Possibly due to less GC pressure
* Still have a latency increase over base
* Probably due to I/O use
#CASSANDRAEU
@richardalow
42. Result 3
* Keep read workload
* Measure same latency
* Add insert workload
* Insert into separate table
* 2500 ops/s
#CASSANDRAEU
@richardalow
44. Analysis
* Lots of latency, but there is anyway
#CASSANDRAEU
@richardalow
45. Performance wrap up
* 2.5x faster with less CPU
=> uses less resources to do the same thing
* Lower, more predictable latencies when at same
speed
=> controlled resource usage lowers latency
impact
* Could limit further to make impact unnoticeable
#CASSANDRAEU
@richardalow
47. Summary
* Discussed analytics use case not well served by
current tools
* Spark, Shark
* SSTableStorageHandler
* Performance results
#CASSANDRAEU
@richardalow
48. Future
* Needs a name
* Github
* Speak to me if you want to use it
* Speak to me if you want to contribute
#CASSANDRAEU
@richardalow