SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
Breakthrough OLAP performance with Cassandra and Spark
Cassandra and Spark
Who am I?
User and contributor to Spark since 0.9, Cassandra since 0.6
Co-creator and maintainer of
Spark Job Server
is a big data technology leader providing solutions for
rapid insights from data.
- the first Spark-Cassandra integration
- an open source Lucene indexer for Cassandra
- open source HDFS for Cassandra
Didn't I attend the same talk last year?
Similar title, but mostly new material
Will reveal new open source projects! :)
Need analytical database / queries on structured big data
Something SQL-like, very flexible and fast
Pre-aggregation too limiting
Fast data / constant updates
Ideally, want my queries to run over fresh data too
Example: Video analytics
Typical collection and analysis of consumer events
3 billion new events every day
Video publishers want updated stats, the sooner the better
Pre-aggregation only enables simple dashboard UIs
What if one wants to offer more advanced analysis, or a
generic data query API?
Eg, top countries filtered by device type, OS, browser
Scalable - rules out PostGreSQL, etc.
Easy to update and ingest new data
Not traditional OLAP cubes - that's not what I'm talking
Very fast for analytical queries - OLAP not OLTP
Extremely flexible queries
Preferably open source
Widely used, lots of support (Spark, Impala, etc.)
Problem: Parquet is read-optimized, not easy to use for writes
Cannot support idempotent writes
Optimized for writing very large chunks, not small updates
Not suitable for time series, IoT, etc.
Often needs multiple passes of jobs for compaction of small
files, deduplication, etc.
People really want a database-like abstraction, not a file format!
Turns out this has been solved before!
Even .Facebook uses Vertica
Easy writes plus fast queries, with constant transfers
Automatic query optimization by storing intermediate query
Stonebraker, et. al. - paper (Brown Univ)CStore
What's wrong with MPP Databases?
Usually don't scale horizontally that well (or cost is prohibitive)
Very flexible data modelling (lists, sets, custom data types)
Easy to operate
Perfect for ingestion of real time / machine data
Best of breed storage technology, huge community
BUT: Simple queries only
Horizontally scalable, in-memory queries
Functional Scala transforms - map, filter, groupBy, sort
SQL, machine learning, streaming, graph, R, many more plugins
all on ONE platform - feed your SQL results to a logistic
Huge number of connectors with every single storage
Spark provides the missing fast, deep
analytics piece of Cassandra!
Separate Storage and Query Layers
Combine best of breed storage and query platforms
Take full advantage of evolution of each
Storage handles replication for availability
Query can replicate data for scaling read concurrency -
Appeared with Spark 1.0
In-memory columnar store
Parquet, Json, Cassandra connector, Avro, many more
SQL as well as DataFrames (Pandas-style) API
Indexing integrated into data sources (eg C* secondary
Write custom functions in Scala .... take that Hive UDFs!!
Integrates well with MLBase, Scala/Java/Python
Connecting Spark to Cassandra
Spark Cassandra Connector
Get started in one line with spark-shell!
Caching a SQL Table from Cassandra
DataFrames support in Cassandra Connector 1.4.0 (and 1.3.0):
Spark does no caching by default - you will always be reading
Spark Cached Tables can be Really Fast
GDELT dataset, 4 million rows, 60 columns, localhost
Almost a 1000x speedup!
On an 8-node EC2 c3.XL cluster, 117 million rows, can run
common queries 1-2 seconds against cached dataset.
Tuning Connector Partitioning
Guideline: One split per partition, one partition per CPU core
Much more parallelism won't speed up job much, but will
starve other C* requests
Problems with Cached Tables
Still have to read the data from Cassandra first, which is slow
Amount of RAM: your entire data + extra for conversion to
Cached tables only live in Spark executors - by default
tied to single context - not HA
once any executor dies, must re-read data from C*
Caching takes time: convert from RDD[Row] to compressed
Cannot easily combine new RDD[Row] with cached tables
(and keep speed)
Problems with Cached Tables
If you don't have enough RAM, Spark can cache your tables
partly to disk. This is still way, way, faster than scanning an entire
C* table. However, cached tables are still tied to a single Spark
Also: rdd.cache()is NOT the same as SQLContext's
What about C* Secondary Indexing?
Spark-Cassandra Connector and Calliope can both reduce I/O by
using Cassandra secondary indices. Does this work with caching?
No, not really, because only the filtered rows would be cached.
Subsequent queries against this limited cached table would not
give you expected results.
Intro to Tachyon
Tachyon: an in-memory cache for HDFS and other binary data
Keeps data off-heap, so multiple Spark applications/executors
can share data
Solves HA problem for data
Wait, wait, wait!
What am I caching exactly? Tachyon is designed for caching files
or binary blobs.
A serialized form of CassandraRow/CassandraRDD?
Raw output from Cassandra driver?
What you really want is this:
Cassandra SSTable -> Tachyon (as row cache) -> CQL -> Spark
Bad programmers worry about the code. Good
programmers worry about data structures.
- Linus Torvalds
Are we really thinking holistically about data modelling, caching,
and how it affects the entire systems architecture?
Efficient Columnar Storage in Cassandra
Wait, I thought Cassandra was columnar?
How Cassandra stores your CQL Tables
Suppose you had this CQL table:
How Cassandra stores your CQL Tables
PartitionKey 01:first 01:last 01:age 02:first 02:last 02:age
Sales Bob Jones 34 Susan O'Connor 40
Engineering Dilbert P ? Dogbert Dog 1
Each row is stored contiguously. All columns in row 2 come after
To analyze only age, C* still has to read every field.
Cassandra is really a row-based, OLTP-oriented datastore.
Unless you know how to use it otherwise :)
The traditional row-based data storage
approach is dead
- Michael Stonebraker
Columnar Storage (Cassandra)
Review: each physical row in Cassandra (e.g. a "partition key")
stores its columns together on disk.
Rowkey 0 1
Name 0 1
Age 46 66
Columnar Format solves I/O
Dictionary compression - HUGE savings for low-cardinality
RLE, other techniques
Only columns needed for query are loaded from disk
Batch multiple rows in one cell for efficiency (avoid cluster key
Columnar Format solves Caching
Use the same format on disk, in cache, in memory scan
Caching works a lot better when the cached object is the
No data format dissonance means bringing in new bits of data
and combining with existing cached data is seamless
So, why isn't everybody doing this?
No columnar storage format designed to work with NoSQL
Efficient conversion to/from columnar format a hard problem
Most infrastructure is still row oriented
Spark SQL/DataFrames based on RDD[Row]
Spark Catalyst is a row-oriented query parser
All hard work leads to profit, but mere talk leads
- Proverbs 14:23
Columnar Storage Performance Study
1979 to now
60 columns, 250 million+ rows, 250GB+
Let's compare Cassandra I/O only, no caching or Spark
Global Database of Events, Language, and Tone
1. Narrow table - CQL table with one row per partition key
2. Wide table - wide rows with 10,000 logical rows per partition
3. Columnar layout - 1000 rows per columnar chunk, wide rows,
with dictionary compression
First 4 million rows, localhost, SSD, C* 2.0.9, LZ4 compression.
Compaction performed before read benchmarks.
Query and ingest times
Scenario Ingest Read all
505 sec 504 sec
365 sec 351 sec
Columnar 93 sec 8.6 sec 0.23 sec
On reads, using a columnar format is up to 2190x faster, while
ingestion is 20-40x faster.
Of course, real life perf gains will depend heavily on query,
table width, etc. etc.
Disk space usage
Scenario Disk used
Narrow table 2.7 GB
Wide table 1.6 GB
Columnar 0.34 GB
The disk space usage helps explain some of the numbers.
The filo project
is a binary data vector library
designed for extreme read performance with minimal
Designed for NoSQL, not a file format
random or linear access
on or off heap
missing value support
Scala only, but cross-platform support possible
What is the ceiling?
This Scala loop can read integers from a binary Filo blob at a rate
of 2 billion integers per second - single threaded:
Vectorization of Spark Queries
Process many elements from the same column at once, keep data
in L1/L2 cache.
Coming in Spark 1.4 through 1.6
Hot Column Caching in Tachyon
Has a "table" feature, originally designed for Shark
Keep hot columnar chunks in shared off-heap memory for fast
What's in the name?
Rich sweet layers of distributed, versioned database goodness
Apache Cassandra. Scale out with no SPOF. Cross-datacenter
replication. Proven storage and database technology.
Incrementally add a column or a few rows as a new version. Easily
control what versions to query. Roll back changes inexpensively.
Stream out new versions as continuous queries :)
Parquet-style storage layout
Retrieve select columns and minimize I/O for OLAP queries
Add a new column without having to copy the whole table
Vectorization and lazy/zero serialization for extreme
Built completely on the Typesafe Platform:
Scala 2.10 and SBT
Spark (including custom data source)
Akka Actors for rational scale-out concurrency
Futures for I/O
Phantom Cassandra client for reactive, type-safe C* I/O
Spark SQL Queries!
Read to and write from Spark Dataframes
Append/merge to FiloDB table from Spark Streaming
FiloDB vs Parquet
Comparable read performance - with lots of space to improve
Assuming co-located Spark and Cassandra
On localhost, both subsecond for simple queries (GDELT
FiloDB has more room to grow - due to hot column caching
and much less deserialization overhead
Lower memory requirement due to much smaller block sizes
Much better fit for IoT / Machine / Time-series applications
Limited support for types
array / set / map support not there, but will be added later
Where FiloDB Fits In
Use regular C* denormalized tables for OLTP and single-key
Use FiloDB for the remaining ad-hoc or more complex
Simplify your analytics infrastructure!
No need to export to Hadoop/Parquet/data warehouse.
Use Spark and C* for both OLAP and OLTP!
Perform ad-hoc OLAP analysis of your time-series, IoT data
Simplify your Lambda Architecture...
With Spark, Cassandra, and FiloDB
Ma, where did all the components go?
You mean I don't have to deal with Hadoop?
Use Cassandra as a front end to store IoT data first
Exactly-Once Ingestion from Kafka
New rows appended via Kafka
Writes are idempotent - no need to dedup!
Converted to columnar chunks on ingest and stored in C*
Only necessary columnar chunks are read into Spark for
You can help!
Send me your use cases for OLAP on Cassandra and Spark
Especially IoT and Geospatial
Email if you want to contribute
to the entire OSS community, but in particular:
Lee Mighdoll, Nest/Google
Rohit Rai and Satya B., Tuplejump
My colleagues at Socrata
If you want to go fast, go alone. If you want to go
far, go together.
-- African proverb
Automatic Columnar Conversion using
Write to Cassandra as you normally do
Custom indexer takes changes, merges and compacts into
columnar chunks behind scenes
Implementing Lambda is Hard
Use real-time pipeline backed by a KV store for new updates
Lots of moving parts
Key-value store, real time sys, batch, etc.
Need to run similar code in two places
Still need to deal with ingesting data to Parquet/HDFS
Need to reconcile queries against two different places