SlideShare une entreprise Scribd logo
1  sur  44
SnappyData
Unified Stream, interactive analytics in
a single in-memory cluster with Spark
www.snappydata.io
Jags Ramnarayan
CTO, Co-founder SnappyData
Feb 2016
SnappyData - an EMC/Pivotal spin out
● New Spark-based open source project started by Pivotal
GemFire founders+engineers
● Decades of in-memory data management experience
● Focus on real-time, operational analytics: Spark inside an
OLTP+OLAP database
www.snappydata.io
Lambda Architecture (LA) for Analytics
Is Lambda Complex ?
Impala,
HBase
Cassandra, Redis, ..
Application
Is Lambda Complex ?
Impala,
HBase
Cassandra, Redis, ..
Application
• Application has to federate queries?
• Complex Application – deal with multiple
data models? Disparate APIs?
• Slow?
• Expensive to maintain?
Can we simplify, optimize?
Can the batch and speed layer use a single, unified serving DB?
Deeper Look – Ad Impression Analytics
Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
Bottlenecks in the write path
Stream micro batches in
parallel from Kafka to each
Spark executor
Emit Key-Value pairs.
So, we can group By
Publisher, Geo
Execute GROUP BY … Expensive
Spark Shuffle …
Shuffle again in DB cluster …
data format changes …
serialization costs
Bottlenecks in the Write Path
• Aggregations –
GroupBy,
MapReduce
• Joins with other
streams,
Reference data
• Replication
(HA) in Fast
data store
Shuffle Costs
• Data models
across Spark
and FastData
store are
different
• Data flows
through too
many
processes
• In-JVM copying
Copying, Serialization Excessive copying in
Java based Scale out stores
Goal – Localize processing
• Can we Localize processing with state, avoid shuffle.
• Can Kafka, Spark partitions and the data store share the same
partitioning policy? ( partition by Advertiser,Geo )
-- Show cassandra, memSQL ingestion performance (maybe later section?)
• Embedded State : Spark’s native MapWithState, Apache Samza, Flink?
Good, scalable KV stores but IS THIS ENOUGH?
Impedance mismatch for Interactive Queries
On Ingested Ad Impressions we want to run queries like:
- Find total uniques for a certain AD grouped on geography,
month
- Impression trends for advertisers, or, detect outliers in
the bidding price
Unfortunately, in most KV oriented stores this is very inefficient
- Not suited for scans, aggregations, distributed joins on large volumes
- Memory utilization in KV store is poor
Why columnar storage?
In-memory Columns but still slow?
SELECT
SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark
-- AWS m2.4xlarge ; total of 342 GB
Can we use Statistical methods to shrink data?
• It is not always possible to store the data in full
Many applications (telecoms, ISPs, search engines) can’t keep
everything
• It is inconvenient to work with data in full
Just because we can, doesn’t mean we should
• It is faster to work with a compact summary
Better to explore data on a laptop than a cluster
Ref: Graham Cormode - Sampling for Big Data
Can we use statistical techniques to understand data, synthesize
something very small but still answer Analytical queries?
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real-time
design center
- Low latency, HA,
concurrent
Vision: Drastically reduce the cost and
complexity in modern big data
Rapidly Maturing Matured over
13 years
SnappyData: A new approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC,
JDBC, REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial project on Approximate
Query Processing(AQP)
MPP DB
Index
Tenets, Guiding Principles
● Memory is the new bottleneck for speed
● Memory densities will follow Moore’s law
● 100% Spark compatible - powerful, concise. Simplify runtime
● Aim for Google Search like speed for analytic queries
● Dramatically reduce costs associated with Analytics
Distributed systems are expensive – Esp. in production
Use Case Patterns
• Stream ingestion database for spark
• Process streams, transform, real-time scoring, store, query
• In-memory database for apps
• Highly concurrent apps, SQL cache, OLTP + OLAP
• Analytic caching pattern
• Caching for Analytics over any “Big data” store (esp MPP)
• Federate query between samples and backend
Why Spark
– Confluence of Streaming, interactive, batch
Unifies batch, streaming, interactive comp.
Easy to build sophisticated applications
Support iterative, graph-parallel algorithms
Powerful APIs in Scala, Python, Java
Spark
Spark
Streaming SQL
BlinkDB
GraphX MLlib
Streamin
g
Batch,
Interactive
Batch, Interactive
Interactiv
e
Data-parallel,
Iterative
Sophisticated algos.
Source: Spark summit presentation from Ion Stoica
Spark Cluster
Snappy Spark Cluster Deployment topologies
• Snappy store and Spark
Executor share the JVM
memory
• Reference based access –
zero copy
• SnappyStore is isolated but
use the same COLUMN
FORMAT AS SPARK for high
throughput
Unified Cluster
Split Cluster
Simple API – Spark Compatible
● Access Table as DataFrame
Catalog is automatically recovered
● Store RDD[T]/DataFrame can be
stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates,
inserts, deletes
//Save a dataFrame using the Snappy or spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema,
props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(pro
ps).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)
val campaignRef: DataFrame = context.table(rowTable)
val parquetData: DataFrame = context.table(parquetTable)
<… Now use any of DataFrame APIs … >
Extends Spark
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘JDBC | ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name', // Default none
PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store.
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];
Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog
(<Columns>) using directkafka_stream options (
<socket endpoints>
"topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps,
count(distinct(cookie)) uniques from AdImpressionLog
window (duration '2' seconds, slide '2' seconds)
where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => {
df.write.format("column").mode(SaveMode.Appen
d)
.saveAsTable("adImpressions")
AdImpression Demo
Spark, SQL Code Walkthrough, interactive SQL
AdImpression Demo
Performance comparison to Cassandra,
MemSQL - TBD
AdImpression Ingest Performance
Loading from parquet files using 4 cores Stream ingest
(Kafka + Spark streaming to store using 1 core)
Why is ingestion, querying fast?
Linearly scale with partition pruning
Input queue,
Stream, IMDB
all share the
same
partitioning
strategy
How does it scale with concurrency?
● Parallel query engine skips Spark SQL
scheduling for low latency queries
● Column tables
For fast scan/aggregations
Also automatically compressed
● Row tables
Fast key based or selective queries
Can have any number of secondary indices
How does it scale with concurrency?
● Distributed shuffle for joins, ordering .. expensive
● Two techniques to minimize this
1) Colocation of related tables. All related records
are collocated.
For instance, ‘Users’ table partitioned across nodes and
all related ‘Ad impressions” can be collocated on same partition.
The parallel query would execute the Join locally on each partition
2) Replication of ‘Dimension’ tables
Joins to dimension tables are always localized
Low latency Interactive Analytic Queries
– Exact or Approximate
Select avg(volume),symbol from T1 where <time range> group by symbol
Select avg(volume), symbol from T1 where <time range> group by symbol
With error 0.1 confidence 0.8
Speed/Accuracy tradeoff
Error
30 mins
Time to Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
32
100 secs
2 secs
Key feature: Synopses Data
● Maintain stratified samples
○ Intelligent sampling to keep error bounds low
● Probabilistic data
○ TopK for time series (using time aggregation CMS, item
aggregation)
○ Histograms, HyperLogLog, Bloom Filters, Wavelets
CREATE SAMPLE TABLE sample-table-name USING columnar
OPTIONS (
BASETABLE ‘table_name’ // source column table or stream table
[ SAMPLINGMETHOD "stratified | uniform" ]
STRATA name (
QCS (“comma-separated-column-names”)
[ FRACTION “frac” ]
),+ // one or more QCS
Unified Cluster Architecture
How do we extend Spark for Real Time?
• Spark Executors are long
running. Driver failure
doesn’t shutdown
Executors
• Driver HA – Drivers run
“Managed” with standby
secondary
• Data HA – Consensus based
clustering integrated for
eager replication
How do we extend Spark for Real Time?
• By pass scheduler for low
latency SQL
• Deep integration with
Spark Catalyst(SQL) –
collocation optimizations,
indexing use, etc
• Full SQL support –
Persistent Catalog,
Transaction, DML
Performance – Spark vs Snappy (TPC-H)
See ACM Sigmod 2016 paper for details
Available on snappydata.io blogs
Performance – Snappy vs in-memoryDB (YCSB)
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source
● Available for download on Github today
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ linkedin: www.linkedin.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com
EXTRAS
www.snappydata.io
Colocated row/column Tables in Spark
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
Row
Table
Column
Table
Spark
Executor
TASK
Spark Block Manager
Stream
processing
● Spark Executors are long lived and shared across multiple apps
● Snappy Memory Mgr and Spark Block Mgr integrated
Table can be partitioned or replicated
Replicated
Table
Partitioned
Table
(Buckets A-H) Replicated
Table
Partitioned
Table
(Buckets I-P)
consistent replica on each node
Partition
Replica
(Buckets A-H)
Replicated
Table
Partitioned
Table
(Buckets Q-W)Partition
Replica
(Buckets I-P)
Data partitioned with one or more replicas
Use partitioned tables for large fact tables, Replicated for small dimension tables
Spark Core
Micro-batch
Streaming
Spark SQL Catalyst
T
X
N
OLAP
Job
Scheduler OLAP
Query
OLTP
Query
AQP
P2P Cluster Replication Svc
RowColumn
Table
Index
Sample
Table
Shared nothing logs
Spark
program
JDBC
ODBC
SnappyData Components

Contenu connexe

Tendances

SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentationpunesparkmeetup
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 

Tendances (20)

SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 

Similaire à Intro to SnappyData Webinar

Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overviewKognitio
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Distributed caching-computing v3.8
Distributed caching-computing v3.8Distributed caching-computing v3.8
Distributed caching-computing v3.8Rahul Gupta
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseTim Vaillancourt
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Designing for Concurrency
Designing for ConcurrencyDesigning for Concurrency
Designing for ConcurrencySusan Potter
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsAlluxio, Inc.
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftLars Kamp
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Community
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 

Similaire à Intro to SnappyData Webinar (20)

Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Distributed caching-computing v3.8
Distributed caching-computing v3.8Distributed caching-computing v3.8
Distributed caching-computing v3.8
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Designing for Concurrency
Designing for ConcurrencyDesigning for Concurrency
Designing for Concurrency
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloads
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon Redshift
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 

Dernier

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Intro to SnappyData Webinar

  • 1. SnappyData Unified Stream, interactive analytics in a single in-memory cluster with Spark www.snappydata.io Jags Ramnarayan CTO, Co-founder SnappyData Feb 2016
  • 2. SnappyData - an EMC/Pivotal spin out ● New Spark-based open source project started by Pivotal GemFire founders+engineers ● Decades of in-memory data management experience ● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database www.snappydata.io
  • 3. Lambda Architecture (LA) for Analytics
  • 4. Is Lambda Complex ? Impala, HBase Cassandra, Redis, .. Application
  • 5. Is Lambda Complex ? Impala, HBase Cassandra, Redis, .. Application • Application has to federate queries? • Complex Application – deal with multiple data models? Disparate APIs? • Slow? • Expensive to maintain?
  • 6. Can we simplify, optimize? Can the batch and speed layer use a single, unified serving DB?
  • 7. Deeper Look – Ad Impression Analytics
  • 8. Ad Impression Analytics Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
  • 9. Bottlenecks in the write path Stream micro batches in parallel from Kafka to each Spark executor Emit Key-Value pairs. So, we can group By Publisher, Geo Execute GROUP BY … Expensive Spark Shuffle … Shuffle again in DB cluster … data format changes … serialization costs
  • 10. Bottlenecks in the Write Path • Aggregations – GroupBy, MapReduce • Joins with other streams, Reference data • Replication (HA) in Fast data store Shuffle Costs • Data models across Spark and FastData store are different • Data flows through too many processes • In-JVM copying Copying, Serialization Excessive copying in Java based Scale out stores
  • 11. Goal – Localize processing • Can we Localize processing with state, avoid shuffle. • Can Kafka, Spark partitions and the data store share the same partitioning policy? ( partition by Advertiser,Geo ) -- Show cassandra, memSQL ingestion performance (maybe later section?) • Embedded State : Spark’s native MapWithState, Apache Samza, Flink? Good, scalable KV stores but IS THIS ENOUGH?
  • 12. Impedance mismatch for Interactive Queries On Ingested Ad Impressions we want to run queries like: - Find total uniques for a certain AD grouped on geography, month - Impression trends for advertisers, or, detect outliers in the bidding price Unfortunately, in most KV oriented stores this is very inefficient - Not suited for scans, aggregations, distributed joins on large volumes - Memory utilization in KV store is poor
  • 14. In-memory Columns but still slow? SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
  • 15. Can we use Statistical methods to shrink data? • It is not always possible to store the data in full Many applications (telecoms, ISPs, search engines) can’t keep everything • It is inconvenient to work with data in full Just because we can, doesn’t mean we should • It is faster to work with a compact summary Better to explore data on a laptop than a cluster Ref: Graham Cormode - Sampling for Big Data Can we use statistical techniques to understand data, synthesize something very small but still answer Analytical queries?
  • 16. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real-time design center - Low latency, HA, concurrent Vision: Drastically reduce the cost and complexity in modern big data Rapidly Maturing Matured over 13 years
  • 17. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real time operational Analytics – TBs in memory RDB Rows Txn Columnar API Stream processing ODBC, JDBC, REST Spark - Scala, Java, Python, R HDFS AQP First commercial project on Approximate Query Processing(AQP) MPP DB Index
  • 18. Tenets, Guiding Principles ● Memory is the new bottleneck for speed ● Memory densities will follow Moore’s law ● 100% Spark compatible - powerful, concise. Simplify runtime ● Aim for Google Search like speed for analytic queries ● Dramatically reduce costs associated with Analytics Distributed systems are expensive – Esp. in production
  • 19. Use Case Patterns • Stream ingestion database for spark • Process streams, transform, real-time scoring, store, query • In-memory database for apps • Highly concurrent apps, SQL cache, OLTP + OLAP • Analytic caching pattern • Caching for Analytics over any “Big data” store (esp MPP) • Federate query between samples and backend
  • 20. Why Spark – Confluence of Streaming, interactive, batch Unifies batch, streaming, interactive comp. Easy to build sophisticated applications Support iterative, graph-parallel algorithms Powerful APIs in Scala, Python, Java Spark Spark Streaming SQL BlinkDB GraphX MLlib Streamin g Batch, Interactive Batch, Interactive Interactiv e Data-parallel, Iterative Sophisticated algos. Source: Spark summit presentation from Ion Stoica
  • 22. Snappy Spark Cluster Deployment topologies • Snappy store and Spark Executor share the JVM memory • Reference based access – zero copy • SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput Unified Cluster Split Cluster
  • 23. Simple API – Spark Compatible ● Access Table as DataFrame Catalog is automatically recovered ● Store RDD[T]/DataFrame can be stored in SnappyData tables ● Access from Remote SQL clients ● Addtional API for updates, inserts, deletes //Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(pro ps).saveAsTable(”T1"); val impressionLogs: DataFrame = context.table(colTable) val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable) <… Now use any of DataFrame APIs … >
  • 24. Extends Spark CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  • 25. Simple to Ingest Streams using SQL Consume from stream Transform raw data Continuous Analytics Ingest into in-memory Store Overflow table to HDFS Create stream table AdImpressionLog (<Columns>) using directkafka_stream options ( <socket endpoints> "topics 'adnetwork-topic’ “, "rowConverter ’ AdImpressionLogAvroDecoder’ ) streamingContext.registerCQ( "select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds) where geo != 'unknown' group by publisher, geo”)// Register CQ .foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Appen d) .saveAsTable("adImpressions")
  • 26. AdImpression Demo Spark, SQL Code Walkthrough, interactive SQL
  • 27. AdImpression Demo Performance comparison to Cassandra, MemSQL - TBD
  • 28. AdImpression Ingest Performance Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)
  • 29. Why is ingestion, querying fast? Linearly scale with partition pruning Input queue, Stream, IMDB all share the same partitioning strategy
  • 30. How does it scale with concurrency? ● Parallel query engine skips Spark SQL scheduling for low latency queries ● Column tables For fast scan/aggregations Also automatically compressed ● Row tables Fast key based or selective queries Can have any number of secondary indices
  • 31. How does it scale with concurrency? ● Distributed shuffle for joins, ordering .. expensive ● Two techniques to minimize this 1) Colocation of related tables. All related records are collocated. For instance, ‘Users’ table partitioned across nodes and all related ‘Ad impressions” can be collocated on same partition. The parallel query would execute the Join locally on each partition 2) Replication of ‘Dimension’ tables Joins to dimension tables are always localized
  • 32. Low latency Interactive Analytic Queries – Exact or Approximate Select avg(volume),symbol from T1 where <time range> group by symbol Select avg(volume), symbol from T1 where <time range> group by symbol With error 0.1 confidence 0.8 Speed/Accuracy tradeoff Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size) 32 100 secs 2 secs
  • 33. Key feature: Synopses Data ● Maintain stratified samples ○ Intelligent sampling to keep error bounds low ● Probabilistic data ○ TopK for time series (using time aggregation CMS, item aggregation) ○ Histograms, HyperLogLog, Bloom Filters, Wavelets CREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS ( BASETABLE ‘table_name’ // source column table or stream table [ SAMPLINGMETHOD "stratified | uniform" ] STRATA name ( QCS (“comma-separated-column-names”) [ FRACTION “frac” ] ),+ // one or more QCS
  • 35. How do we extend Spark for Real Time? • Spark Executors are long running. Driver failure doesn’t shutdown Executors • Driver HA – Drivers run “Managed” with standby secondary • Data HA – Consensus based clustering integrated for eager replication
  • 36. How do we extend Spark for Real Time? • By pass scheduler for low latency SQL • Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc • Full SQL support – Persistent Catalog, Transaction, DML
  • 37. Performance – Spark vs Snappy (TPC-H) See ACM Sigmod 2016 paper for details Available on snappydata.io blogs
  • 38. Performance – Snappy vs in-memoryDB (YCSB)
  • 39. Unified OLAP/OLTP streaming w/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  • 40. www.snappydata.io SnappyData is Open Source ● Available for download on Github today ● https://github.com/SnappyDataInc/snappydata ● Learn more www.snappydata.io/blog ● Connect: ○ twitter: www.twitter.com/snappydata ○ facebook: www.facebook.com/snappydata ○ linkedin: www.linkedin.com/snappydata ○ slack: http://snappydata-slackin.herokuapp.com
  • 42. Colocated row/column Tables in Spark Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing ● Spark Executors are long lived and shared across multiple apps ● Snappy Memory Mgr and Spark Block Mgr integrated
  • 43. Table can be partitioned or replicated Replicated Table Partitioned Table (Buckets A-H) Replicated Table Partitioned Table (Buckets I-P) consistent replica on each node Partition Replica (Buckets A-H) Replicated Table Partitioned Table (Buckets Q-W)Partition Replica (Buckets I-P) Data partitioned with one or more replicas Use partitioned tables for large fact tables, Replicated for small dimension tables
  • 44. Spark Core Micro-batch Streaming Spark SQL Catalyst T X N OLAP Job Scheduler OLAP Query OLTP Query AQP P2P Cluster Replication Svc RowColumn Table Index Sample Table Shared nothing logs Spark program JDBC ODBC SnappyData Components

Notes de l'éditeur

  1. There is a reciprocal relationship with Spark RDDs/DataFrames. any table is visible as a DataFrame and vice versa. Hence, all the spark APIs, tranformations can also be applied to snappy managed tables. For instance, you can use the DataFrame data source API to save any arbitrary DataFrame into a snappy table like shown in the example. One cool aspect of Spark is its ability to take an RDD of objects (say with nested structure) and implicitly infer its schema. i.e. turn into into a DataFrame and store it.
  2. The SQL dialect will be Spark SQL ++. i.e. we are extending SQL to be much more compliant with standard SQL. A number of the extensions that dictate things like HA, disk persistence, etc are all specified through OPTIONS in spark SQL.
  3. CREATE HDFSSTORE streamingstore NameNode 'hdfs://gfxd1:8020' HomeDir 'stream-tables' BatchSize 10 BatchTimeInterval 2000 milliseconds QueuePersistent true MaxWriteOnlyFileSize 200 WriteOnlyFileRolloverInterval 1 minute;
  4. And, of course, the whole point behind colocation is to linear scale with minimal or even no shuffling. So, for instance, when using Kafka, all three components - Kafka, native RDD in Spark and Table in Snappy can all share the same partitioning strategy. As an example, in our telco case, all records associated with a subscriber can be colocated onto the same node - the queue, spark procesing of partitions and related reference data in Snappy store.
  5. When it comes to interactive analytics a lot is exploratory in nature. Folks are looking trends for different time periods, studying outlier patterns, etc. Unfortunately, like pointed out before, analytic queries can take a llong time even when in-memory. We want such exploratory analytics to ultimately happen at google like speeds. Don’t break the speed of thought. In many cases, do we really need a precise answer? like watching a trend on a visualization tool. We are thowing linear improvements to what seems like an exponential problem - like in some IoT scenarios. Stratified sampling allows the user to more intelligently sample so we can answer queries with a very small fraction of the data with good accuracy. What we do is allow the user to create one or more stratified samples on some “base” table data. The base table maybe all in-memory also or more often than not, could reside in HDFS.
  6. Manage data(mutable) in spark executors (store memory mgr works with Block mgr) Make executors long lived Which means, spark drivers run de-coupled .. they can fail. - managed Drivers - Selective scheduling - Deeply integrate with query engine for optimizations - Full SQL support: including transactions, DML, catalog integration
  7. Manage data(mutable) in spark executors (store memory mgr works with Block mgr) Make executors long lived Which means, spark drivers run de-coupled .. they can fail. - managed Drivers - Selective scheduling - Deeply integrate with query engine for optimizations - Full SQL support: including transactions, DML, catalog integration
  8. By default, we start the Spark cluster in an “embedded” mode. i.e. the in-memory store is fully collocated and in the same process space. We had to change the spark Block manager so both Gem as well as spark shares the same space for tables, cached RDDs, shuffle space, sorting, etc. This space can extend from JVM heap to off-heap. GemFire proactively monitors the JVM “old gen” so never goes beyond a certain critical threshold. i.e. we do a number of things so you don’t run OOM. Hoping to contribute this back to Spark. And, when running in the embedded mode we also make sure the executors are long lived. i.e. life cycle for these nodes are no longer tied to the Driver availability. Everything Spark does it cleaned up as expected though.
  9. Partitioning strategy, by default, is the same as Spark. We try to do uniform random distribution of the records across all the nodes designated to host a partitioned table. Any table can have one or more replicas. Replicas are always consistent with each other - sync writes. We parallely send the write to each replica and wait for ACKs. If a ACK is not received we start SUSPECT processing. Replicated tables, by default, are replicated to each node. Replicas are guaranteed to be consistent when failures occur i.e. the failed node rejoins. How do you recreate the state of the replica while thousands of other concurrent writes are in progress is a hard problem to solve.