How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE
PEOPLE.
data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org
data.seattle.govdata.oregon.govdata.wa.gov
www.metrochicagodata.orgdata.cityofboston.gov
info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov
data.nola.govdata.illinois.govdata.colorado.gov
data.austintexas.govdata.undp.orgwww.opendatanyc.com
data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it
data.montgomerycountymd.govdata.cityofnewyork.us
data.acgov.orgdata.baltimorecity.govdata.energystar.gov
data.somervillema.govdata.maryland.govdata.taxpayer.net
bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org
5. BIG DATA AT OOYALA
2.5 billionanalytics pings aday= almostatrillionevents a
year.
Rollup tables -30 million rows per day
6. BIG DATA AT SOCRATA
Hundreds of datasets, each one up to 30 million rows
Customer demand for billion row datasets
7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A
YEAR'S WORTH OF DATA?
Flexible -complex queries included
Sometimes you can'tdenormalize your dataenough
Fast-interactive speeds
8. RDBMS? POSTGRES?
Starthittinglatencylimits at~10 million rows
No robustand inexpensive solution for queryingacross shards
No robustwayto scale horizontally
Complex and expensive to improve performance (egrollup
tables)
9. OLAP CUBES?
Materialize summaryfor everypossible combination
Too complicated and brittle
Takes forever to compute
Explodes storage and memory
13. APACHE SPARK
Horizontallyscalable, in-memoryqueries
FunctionalScalatransforms -map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, manymore plugins
allon ONE platform -feed your SQL results to alogistic
regression, easy!
THE Hottestbigdataplatform, huge community, leaving
Hadoop in the dust
Developers love it
19. SEPARATE STORAGE AND QUERY LAYERS
Combine bestof breed storage and queryplatforms
Take fulladvantage of evolution of each
Storage handles replication for availability
Querycan replicate datafor scalingread concurrency-
independent!
25. No existinggeneric queryengine for Spark when we started
(Shark was in infancy, had no indexes, etc.), so we builtour own
For everyrow, need to extractoutneeded columns
Abilityto selectarbitrarycolumns means usingSeq[Any], no
type safety
Boxingmakes integer aggregation veryexpensive and memory
inefficient
28. TRADITIONAL ROW-BASED STORAGE
Same layoutin memoryand on disk:
Name Age
Barak 46
Hillary 66
Each row is stored contiguously. Allcolumns in row 2 come after
row 1.
30. COLUMNAR STORAGE (CASSANDRA)
Review: each physicalrow in Cassandra(e.g. a"partition key")
stores its columns together on disk.
SchemaCF
Rowkey Type
Name StringDict
Age Int
DataCF
Rowkey 0 1
Name 0 1
Age 46 66
31. ADVANTAGES OF COLUMNAR STORAGE
Compression
Dictionarycompression -HUGE savings for low-cardinality
stringcolumns
RLE
Reduce I/O
Onlycolumns needed for queryare loaded from disk
Can keep strongtypes in memory, avoid boxing
Batch multiple rows in one cellfor efficiency
32. ADVANTAGES OF COLUMNAR QUERYING
Cache localityfor aggregatingcolumn of data
Take advantage of CPU/GPUvector instructions for ints /
doubles
avoid row-ifyinguntillastpossible moment
easyto derive computed columns
Use vector data/linear math libraries
33. COLUMNAR QUERY ENGINE VS ROW-BASED IN
SCALA
Custom RDD of column-oriented blocks of data
Uses ~10xless heap
10-100xfaster for group by's on asingle node
Scan speed in excess of 150M rows/sec/core for integer
aggregations
36. DATASTAX: CASSANDRA SPARK INTEGRATION
Datastax Enterprise now comes with HASpark
HAmaster, thatis.
cassandra-driver-spark
37. SPARK SQL
Appeared with Spark 1.0
In-memorycolumnar store
Can read from Parquetnow; Cassandraintegration coming
Queryingis notcolumn-based (yet)
No indexes
Write custom functions in Scala.... take thatHive UDFs!!
Integrates wellwith MLBase, Scala/Java/Python
39. GETTING TO A BILLION ROWS / SEC
Benchmarked at20 million rows/sec, GROUP BY on two
columns, aggregatingtwo more columns. Per core.
50 cores needed for parallellocalized groupingthroughputof
1 billion rows
~5-10 additionalcores budgetfor distributed exchange and
groupingof locallyagggregated groups, dependingon result
size and network topology
Above is acustom solution, NOTSpark SQL.
Look for integration with Spark/SQL for aproper solution
40. LESSONS
Extremelyfastdistributed queryingfor these use cases
Datadoesn'tchange much (and onlybulk changes)
Analyticalqueries for subsetof columns
Focused on numericalaggregations
Smallnumbers of group bys, limited network interchange of
data
Spark abitrough around edges, butevolvingfast
Concurrentqueries is afrontier with Spark. Use additional
Spark contexts.
42. SOME COLUMNAR
ALTERNATIVES
Monetdb and Infobright-true columnar stores (storage +
querying)
Cstore-fdw for PostGres -columnar storage only
VoltDB-in-memorydistributed columnar database (butneed
to recompile for DDL changes)
Google BigQuery-columnar cloud database, Dremelbased
Amazon RedShift