The slides for the first ever SnappyData webinar. Covers SnappyData core concepts, programming models, benchmarks and more.
SnappyData is open sourced here: https://github.com/SnappyDataInc/snappydata
We also have a deep technical paper here: http://www.snappydata.io/snappy-industrial
We can be easily contacted on Slack, Gitter and more: http://www.snappydata.io/about#contactus
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Intro to SnappyData Webinar
1. SnappyData
Unified Stream, interactive analytics in
a single in-memory cluster with Spark
www.snappydata.io
Jags Ramnarayan
CTO, Co-founder SnappyData
Feb 2016
2. SnappyData - an EMC/Pivotal spin out
● New Spark-based open source project started by Pivotal
GemFire founders+engineers
● Decades of in-memory data management experience
● Focus on real-time, operational analytics: Spark inside an
OLTP+OLAP database
www.snappydata.io
8. Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
9. Bottlenecks in the write path
Stream micro batches in
parallel from Kafka to each
Spark executor
Emit Key-Value pairs.
So, we can group By
Publisher, Geo
Execute GROUP BY … Expensive
Spark Shuffle …
Shuffle again in DB cluster …
data format changes …
serialization costs
10. Bottlenecks in the Write Path
• Aggregations –
GroupBy,
MapReduce
• Joins with other
streams,
Reference data
• Replication
(HA) in Fast
data store
Shuffle Costs
• Data models
across Spark
and FastData
store are
different
• Data flows
through too
many
processes
• In-JVM copying
Copying, Serialization Excessive copying in
Java based Scale out stores
11. Goal – Localize processing
• Can we Localize processing with state, avoid shuffle.
• Can Kafka, Spark partitions and the data store share the same
partitioning policy? ( partition by Advertiser,Geo )
-- Show cassandra, memSQL ingestion performance (maybe later section?)
• Embedded State : Spark’s native MapWithState, Apache Samza, Flink?
Good, scalable KV stores but IS THIS ENOUGH?
12. Impedance mismatch for Interactive Queries
On Ingested Ad Impressions we want to run queries like:
- Find total uniques for a certain AD grouped on geography,
month
- Impression trends for advertisers, or, detect outliers in
the bidding price
Unfortunately, in most KV oriented stores this is very inefficient
- Not suited for scans, aggregations, distributed joins on large volumes
- Memory utilization in KV store is poor
14. In-memory Columns but still slow?
SELECT
SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark
-- AWS m2.4xlarge ; total of 342 GB
15. Can we use Statistical methods to shrink data?
• It is not always possible to store the data in full
Many applications (telecoms, ISPs, search engines) can’t keep
everything
• It is inconvenient to work with data in full
Just because we can, doesn’t mean we should
• It is faster to work with a compact summary
Better to explore data on a laptop than a cluster
Ref: Graham Cormode - Sampling for Big Data
Can we use statistical techniques to understand data, synthesize
something very small but still answer Analytical queries?
16. SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real-time
design center
- Low latency, HA,
concurrent
Vision: Drastically reduce the cost and
complexity in modern big data
Rapidly Maturing Matured over
13 years
17. SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC,
JDBC, REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial project on Approximate
Query Processing(AQP)
MPP DB
Index
18. Tenets, Guiding Principles
● Memory is the new bottleneck for speed
● Memory densities will follow Moore’s law
● 100% Spark compatible - powerful, concise. Simplify runtime
● Aim for Google Search like speed for analytic queries
● Dramatically reduce costs associated with Analytics
Distributed systems are expensive – Esp. in production
19. Use Case Patterns
• Stream ingestion database for spark
• Process streams, transform, real-time scoring, store, query
• In-memory database for apps
• Highly concurrent apps, SQL cache, OLTP + OLAP
• Analytic caching pattern
• Caching for Analytics over any “Big data” store (esp MPP)
• Federate query between samples and backend
20. Why Spark
– Confluence of Streaming, interactive, batch
Unifies batch, streaming, interactive comp.
Easy to build sophisticated applications
Support iterative, graph-parallel algorithms
Powerful APIs in Scala, Python, Java
Spark
Spark
Streaming SQL
BlinkDB
GraphX MLlib
Streamin
g
Batch,
Interactive
Batch, Interactive
Interactiv
e
Data-parallel,
Iterative
Sophisticated algos.
Source: Spark summit presentation from Ion Stoica
22. Snappy Spark Cluster Deployment topologies
• Snappy store and Spark
Executor share the JVM
memory
• Reference based access –
zero copy
• SnappyStore is isolated but
use the same COLUMN
FORMAT AS SPARK for high
throughput
Unified Cluster
Split Cluster
23. Simple API – Spark Compatible
● Access Table as DataFrame
Catalog is automatically recovered
● Store RDD[T]/DataFrame can be
stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates,
inserts, deletes
//Save a dataFrame using the Snappy or spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema,
props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(pro
ps).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)
val campaignRef: DataFrame = context.table(rowTable)
val parquetData: DataFrame = context.table(parquetTable)
<… Now use any of DataFrame APIs … >
24. Extends Spark
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘JDBC | ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name', // Default none
PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store.
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];
25. Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog
(<Columns>) using directkafka_stream options (
<socket endpoints>
"topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps,
count(distinct(cookie)) uniques from AdImpressionLog
window (duration '2' seconds, slide '2' seconds)
where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => {
df.write.format("column").mode(SaveMode.Appen
d)
.saveAsTable("adImpressions")
29. Why is ingestion, querying fast?
Linearly scale with partition pruning
Input queue,
Stream, IMDB
all share the
same
partitioning
strategy
30. How does it scale with concurrency?
● Parallel query engine skips Spark SQL
scheduling for low latency queries
● Column tables
For fast scan/aggregations
Also automatically compressed
● Row tables
Fast key based or selective queries
Can have any number of secondary indices
31. How does it scale with concurrency?
● Distributed shuffle for joins, ordering .. expensive
● Two techniques to minimize this
1) Colocation of related tables. All related records
are collocated.
For instance, ‘Users’ table partitioned across nodes and
all related ‘Ad impressions” can be collocated on same partition.
The parallel query would execute the Join locally on each partition
2) Replication of ‘Dimension’ tables
Joins to dimension tables are always localized
32. Low latency Interactive Analytic Queries
– Exact or Approximate
Select avg(volume),symbol from T1 where <time range> group by symbol
Select avg(volume), symbol from T1 where <time range> group by symbol
With error 0.1 confidence 0.8
Speed/Accuracy tradeoff
Error
30 mins
Time to Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
32
100 secs
2 secs
33. Key feature: Synopses Data
● Maintain stratified samples
○ Intelligent sampling to keep error bounds low
● Probabilistic data
○ TopK for time series (using time aggregation CMS, item
aggregation)
○ Histograms, HyperLogLog, Bloom Filters, Wavelets
CREATE SAMPLE TABLE sample-table-name USING columnar
OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table
[ SAMPLINGMETHOD "stratified | uniform" ]
STRATA name (
QCS (“comma-separated-column-names”)
[ FRACTION “frac” ]
),+ // one or more QCS
35. How do we extend Spark for Real Time?
• Spark Executors are long
running. Driver failure
doesn’t shutdown
Executors
• Driver HA – Drivers run
“Managed” with standby
secondary
• Data HA – Consensus based
clustering integrated for
eager replication
36. How do we extend Spark for Real Time?
• By pass scheduler for low
latency SQL
• Deep integration with
Spark Catalyst(SQL) –
collocation optimizations,
indexing use, etc
• Full SQL support –
Persistent Catalog,
Transaction, DML
37. Performance – Spark vs Snappy (TPC-H)
See ACM Sigmod 2016 paper for details
Available on snappydata.io blogs
39. Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive
40. www.snappydata.io
SnappyData is Open Source
● Available for download on Github today
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ linkedin: www.linkedin.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com
43. Table can be partitioned or replicated
Replicated
Table
Partitioned
Table
(Buckets A-H) Replicated
Table
Partitioned
Table
(Buckets I-P)
consistent replica on each node
Partition
Replica
(Buckets A-H)
Replicated
Table
Partitioned
Table
(Buckets Q-W)Partition
Replica
(Buckets I-P)
Data partitioned with one or more replicas
Use partitioned tables for large fact tables, Replicated for small dimension tables
44. Spark Core
Micro-batch
Streaming
Spark SQL Catalyst
T
X
N
OLAP
Job
Scheduler OLAP
Query
OLTP
Query
AQP
P2P Cluster Replication Svc
RowColumn
Table
Index
Sample
Table
Shared nothing logs
Spark
program
JDBC
ODBC
SnappyData Components
Notes de l'éditeur
There is a reciprocal relationship with Spark RDDs/DataFrames. any table is visible as a DataFrame and vice versa. Hence, all the spark APIs, tranformations can also be applied to snappy managed tables.
For instance, you can use the DataFrame data source API to save any arbitrary DataFrame into a snappy table like shown in the example.
One cool aspect of Spark is its ability to take an RDD of objects (say with nested structure) and implicitly infer its schema. i.e. turn into into a DataFrame and store it.
The SQL dialect will be Spark SQL ++. i.e. we are extending SQL to be much more compliant with standard SQL.
A number of the extensions that dictate things like HA, disk persistence, etc are all specified through OPTIONS in spark SQL.
And, of course, the whole point behind colocation is to linear scale with minimal or even no shuffling. So, for instance, when using Kafka, all three components - Kafka, native RDD in Spark and Table in Snappy can all share the same partitioning strategy.
As an example, in our telco case, all records associated with a subscriber can be colocated onto the same node - the queue, spark procesing of partitions and related reference data in Snappy store.
When it comes to interactive analytics a lot is exploratory in nature. Folks are looking trends for different time periods, studying outlier patterns, etc. Unfortunately, like pointed out before, analytic queries can take a llong time even when in-memory. We want such exploratory analytics to ultimately happen at google like speeds. Don’t break the speed of thought.
In many cases, do we really need a precise answer? like watching a trend on a visualization tool.
We are thowing linear improvements to what seems like an exponential problem - like in some IoT scenarios.
Stratified sampling allows the user to more intelligently sample so we can answer queries with a very small fraction of the data with good accuracy.
What we do is allow the user to create one or more stratified samples on some “base” table data. The base table maybe all in-memory also or more often than not, could reside in HDFS.
Manage data(mutable) in spark executors (store memory mgr works with Block mgr)
Make executors long lived
Which means, spark drivers run de-coupled .. they can fail.
- managed Drivers
- Selective scheduling
- Deeply integrate with query engine for optimizations
- Full SQL support: including transactions, DML, catalog integration
Manage data(mutable) in spark executors (store memory mgr works with Block mgr)
Make executors long lived
Which means, spark drivers run de-coupled .. they can fail.
- managed Drivers
- Selective scheduling
- Deeply integrate with query engine for optimizations
- Full SQL support: including transactions, DML, catalog integration
By default, we start the Spark cluster in an “embedded” mode. i.e. the in-memory store is fully collocated and in the same process space. We had to change the spark Block manager so both Gem as well as spark shares the same space for tables, cached RDDs, shuffle space, sorting, etc. This space can extend from JVM heap to off-heap.
GemFire proactively monitors the JVM “old gen” so never goes beyond a certain critical threshold. i.e. we do a number of things so you don’t run OOM. Hoping to contribute this back to Spark.
And, when running in the embedded mode we also make sure the executors are long lived. i.e. life cycle for these nodes are no longer tied to the Driver availability. Everything Spark does it cleaned up as expected though.
Partitioning strategy, by default, is the same as Spark. We try to do uniform random distribution of the records across all the nodes designated to host a partitioned table. Any table can have one or more replicas. Replicas are always consistent with each other - sync writes. We parallely send the write to each replica and wait for ACKs. If a ACK is not received we start SUSPECT processing.
Replicated tables, by default, are replicated to each node. Replicas are guaranteed to be consistent when failures occur i.e. the failed node rejoins. How do you recreate the state of the replica while thousands of other concurrent writes are in progress is a hard problem to solve.