OLAP with Cassandra and Spark

OLAP WITH SPARK AND
CASSANDRA
EVAN CHAN
JULY 2014

WHO AM I?
PrincipalEngineer,
@evanfchan
Creator of
Socrata, Inc.
http://github.com/velvia
Spark Job Server

WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE
PEOPLE.
data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org
data.seattle.govdata.oregon.govdata.wa.gov
www.metrochicagodata.orgdata.cityofboston.gov
info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov
data.nola.govdata.illinois.govdata.colorado.gov
data.austintexas.govdata.undp.orgwww.opendatanyc.com
data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it
data.montgomerycountymd.govdata.cityofnewyork.us
data.acgov.orgdata.baltimorecity.govdata.energystar.gov
data.somervillema.govdata.maryland.govdata.taxpayer.net
bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org

BIG DATA AT OOYALA
2.5 billionanalytics pings aday= almostatrillionevents a
year.
Rollup tables -30 million rows per day

BIG DATA AT SOCRATA
Hundreds of datasets, each one up to 30 million rows
Customer demand for billion row datasets

HOW CAN WE ALLOW CUSTOMERS TO QUERY A
YEAR'S WORTH OF DATA?
Flexible -complex queries included
Sometimes you can'tdenormalize your dataenough
Fast-interactive speeds

RDBMS? POSTGRES?
Starthittinglatencylimits at~10 million rows
No robustand inexpensive solution for queryingacross shards
No robustwayto scale horizontally
Complex and expensive to improve performance (egrollup
tables)

OLAP CUBES?
Materialize summaryfor everypossible combination
Too complicated and brittle
Takes forever to compute
Explodes storage and memory

When in doubt, use brute force
- Ken Thompson

CASSANDRA
Horizontallyscalable
Veryflexible datamodelling(lists, sets, custom datatypes)
Easyto operate
No fear of number of rows or documents
Bestof breed storage technology, huge community
BUT: Simplequeries only

APACHE SPARK
Horizontallyscalable, in-memoryqueries
FunctionalScalatransforms -map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, manymore plugins
allon ONE platform -feed your SQL results to alogistic
regression, easy!
THE Hottestbigdataplatform, huge community, leaving
Hadoop in the dust
Developers love it

SPARK PROVIDES THE MISSING FAST, DEEP
ANALYTICS PIECE OF CASSANDRA!

INTEGRATING SPARK AND CASSANDRA
Scalasolutions:
Datastax integration:
(CQL-based)
https://github.com/datastax/cassandra-
driver-spark
Calliope

Abitmore work:
Use traditionalCassandraclientwith RDDs
Use an existingInputFormat, like CqlPagedInputFormat

EXAMPLE CUSTOM INTEGRATION USING
ASTYANAX
valcassRDD=sc.parallelize(rowkeys).
flatMap{rowkey=>
columnFamily.get(rowkey).execute().asScala
}

A SPARK AND CASSANDRA
OLAP ARCHITECTURE

SEPARATE STORAGE AND QUERY LAYERS
Combine bestof breed storage and queryplatforms
Take fulladvantage of evolution of each
Storage handles replication for availability
Querycan replicate datafor scalingread concurrency-
independent!

SCALE NODES, NOT
DEVELOPER TIME!!

KEEPING IT SIMPLE
Maximize row scan speed
Columnar representation for efficiency
Compressed bitmap indexes for fastalgebra
Functionaltransforms for easymemoization, testing,
concurrency, composition

EVEN BETTER: TACHYON OFF-HEAP CACHING

INITIAL ATTEMPTS
valrows=Seq(
Seq("Burglary","19xxHurston",10),
Seq("Theft","55xxFloatillaAve",5)
)
sc.parallelize(rows)
.map{values=>(values[0],values)}
.groupByKey
.reduce(_[2]+_[2])

No existinggeneric queryengine for Spark when we started
(Shark was in infancy, had no indexes, etc.), so we builtour own
For everyrow, need to extractoutneeded columns
Abilityto selectarbitrarycolumns means usingSeq[Any], no
type safety
Boxingmakes integer aggregation veryexpensive and memory
inefficient

The traditional row-based datastorage
approach is dead
- Michael Stonebraker

TRADITIONAL ROW-BASED STORAGE
Same layoutin memoryand on disk:
Name Age
Barak 46
Hillary 66
Each row is stored contiguously. Allcolumns in row 2 come after
row 1.

COLUMNAR STORAGE (MEMORY)
Namecolumn
0 1
0 1
Dictionary: {0: "Barak", 1: "Hillary"}
Agecolumn
0 1
46 66

COLUMNAR STORAGE (CASSANDRA)
Review: each physicalrow in Cassandra(e.g. a"partition key")
stores its columns together on disk.
SchemaCF
Rowkey Type
Name StringDict
Age Int
DataCF
Rowkey 0 1
Name 0 1
Age 46 66

ADVANTAGES OF COLUMNAR STORAGE
Compression
Dictionarycompression -HUGE savings for low-cardinality
stringcolumns
RLE
Reduce I/O
Onlycolumns needed for queryare loaded from disk
Can keep strongtypes in memory, avoid boxing
Batch multiple rows in one cellfor efficiency

ADVANTAGES OF COLUMNAR QUERYING
Cache localityfor aggregatingcolumn of data
Take advantage of CPU/GPUvector instructions for ints /
doubles
avoid row-ifyinguntillastpossible moment
easyto derive computed columns
Use vector data/linear math libraries

COLUMNAR QUERY ENGINE VS ROW-BASED IN
SCALA
Custom RDD of column-oriented blocks of data
Uses ~10xless heap
10-100xfaster for group by's on asingle node
Scan speed in excess of 150M rows/sec/core for integer
aggregations

SO, GREAT, OLAP WITH CASSANDRA AND
SPARK. NOW WHAT?

DATASTAX: CASSANDRA SPARK INTEGRATION
Datastax Enterprise now comes with HASpark
HAmaster, thatis.
cassandra-driver-spark

SPARK SQL
Appeared with Spark 1.0
In-memorycolumnar store
Can read from Parquetnow; Cassandraintegration coming
Queryingis notcolumn-based (yet)
No indexes
Write custom functions in Scala.... take thatHive UDFs!!
Integrates wellwith MLBase, Scala/Java/Python

WORK STILL NEEDED
Indexes
Columnar queryingfor fastaggregation
Efficientreadingfrom columnar storage formats

GETTING TO A BILLION ROWS / SEC
Benchmarked at20 million rows/sec, GROUP BY on two
columns, aggregatingtwo more columns. Per core.
50 cores needed for parallellocalized groupingthroughputof
1 billion rows
~5-10 additionalcores budgetfor distributed exchange and
groupingof locallyagggregated groups, dependingon result
size and network topology
Above is acustom solution, NOTSpark SQL.
Look for integration with Spark/SQL for aproper solution

LESSONS
Extremelyfastdistributed queryingfor these use cases
Datadoesn'tchange much (and onlybulk changes)
Analyticalqueries for subsetof columns
Focused on numericalaggregations
Smallnumbers of group bys, limited network interchange of
data
Spark abitrough around edges, butevolvingfast
Concurrentqueries is afrontier with Spark. Use additional
Spark contexts.

SOME COLUMNAR
ALTERNATIVES
Monetdb and Infobright-true columnar stores (storage +
querying)
Cstore-fdw for PostGres -columnar storage only
VoltDB-in-memorydistributed columnar database (butneed
to recompile for DDL changes)
Google BigQuery-columnar cloud database, Dremelbased
Amazon RedShift

OLAP with Cassandra and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to OLAP with Cassandra and Spark

Similar to OLAP with Cassandra and Spark (20)

More from Evan Chan

More from Evan Chan (10)

OLAP with Cassandra and Spark