SlideShare une entreprise Scribd logo
1  sur  62
Télécharger pour lire hors ligne
BlinkDB and G-OLA:
Supporting Approximate Answers in SparkSQL
Sameer Agarwal and Kai Zeng
Spark Summit | San Francisco, CA | June 15th 2015
About Us
1. Sameer Agarwal
- Software Engineer at Databricks
- PhD in Databases (UC Berkeley)
- Research on ApproximateQuery Processing (BlinkDB)
2. Kai Zeng
- Post-doc in AMP Lab/ Intern at Databricks
- PhD in Databases (UCLA)
- Research on ApproximateQuery Processing (ABM)
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
100 TB on 1000 machines
Continuous Query Execution on Samples of Data
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Continuous Query Execution on Samples
What is the average latency in
the table?
34.6667
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
35
Continuous Query Execution on Samples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
35 ± 2.1
Continuous Query Execution on Samples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
35 ± 2.1
33.83 ± 1.3
Continuous Query Execution on Samples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Continuous Query Execution on Samples
9
Demo
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
G-OLA
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  batch  processing
val result  =  dataFrame.collect()  //  34.6667
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
onlineDataFrame.collectNext() //  35  ± 2.1
onlineDataFrame.collectNext() //  33.83  ± 1.3
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext())  {
onlineDataFrame.collectNext()
}
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
responseTime <=  10.seconds)  {
onlineDataFrame.collectNext()
}
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
errorBound >=  0.01)  {
onlineDataFrame.collectNext()
}
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
AGGREGATES/  UDAFs
JOINS/GROUP  BYs
NESTED  QUERIES
Interface
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
SELECT
foo (*)
FROM
TABLE
A ± ε
Query
Interface
Error
Estimation
Query
Execution
Data
Storage
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry
Milner, Samuel Madden, Ion Stoica. BlinkDB: Queries
with Bounded Errors and Bounded Response Times on
Very Large Data. In ACM EuroSys 2013.
Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion
Stoica, Michael Jordan. A General Bootstrap
Performance Diagnostic. In ACM KDD 2013
Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet
Talwalkar,Michael Jordan, Samuel Madden, Barzan
Mozafari, Ion Stoica. Knowing When You’re Wrong:
Building Fast and Reliable Approximate Query
Processing Systems. In ACM SIGMOD 2014.
Continuous Query Execution on Samples
Focused on estimating aggregate errors given representative
samples
Central LimitTheorem (CLT) Error Estimation using Bootstrap
HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93
21
Error Estimation on a Sample of Data
d
predicate for the query)
The following results are (asymptotically in sample size) true, but not di-
rectly useful, since they depend on unknown properties of the underlying dis-
tribution. In all cases we just plug in the sample values. For example, instead
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count, I assume no filtering
(p = 1). Filtering will increase variance a bit, or potentially a lot for extremely
selective queries (p = 0). I can compute the filtering-adjusted values if you like.
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
Sampling!
…
Resampling!
D
S100
S1
S
θ(S1
)
θ(S100
)θ(S) 95%
confidence
interval!
…
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
ID City Latency
1 SLC 37
2 NYC 30
3 SLC 34
4 LA 36
5 NYC 30
ID City Latency
1 SLC 34
2 SLC 37
3 SLC 34
4 LA 36
5 LA 36
...
θ1 = 34.6
...
35 ± 1.6
θ2 = 33.4 θ100 = 35.4
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
Error Estimation in BlinkDB
Leverage Poissonized
Resampling to generate
samples with replacement
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
Sample from a
Poisson (1) Distribution
θ1 = 33.8
Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
6 SF 28 2
Incremental Error
Estimation
Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1 #2
1 NYC 30 2 1
2 NYC 38 1 0
3 SLC 34 0 2
4 SLC 34 1 2
5 SLC 37 1 0
6 SF 28 2 1
Construct all
Resamples in
a Single Pass
Error Estimation in BlinkDB
0.2-0.5% additional overhead
High Level Take-away:
Bootstrap and Poissonized Resampling
Techniques are the key towards
achieving quick and continuous error
bars for a general set of queries
30
SELECT
foo (*)
FROM
TABLE
A ± ε
Query
Interface
Error
Estimation
Query
Execution
Data
Storage
G-OLA
Kai Zeng, Sameer Agarwal, Ankur Dave,
Michael Armbrust and Ion Stoica.
G-OLA: Generalized Online Aggregation
for Interactive Analysis on Big Data. In
SIGMOD 2015.
Continuous Query Execution on Samples
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A”
10+10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A”
10+10+10  sec
Overall Quadratic Cost!
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Delta Update Query
Delta Update Queries
Data
Query
Answer ± ε
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
47
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
48
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
49
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
AVG
SCAN
A
50
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
A
latency > A
(I)
51
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
A
latency > A
A’
A’
(I) (II)
52
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
A
latency > A
(I)
53
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
A ± ε
(I)
54
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
(I)
55
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
latency < 8
(I)
56
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
latency > 12
(I)
57
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
8 < latency < 12
(I)
58
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
latency > A
10±2
8 < latency < 12
(I) (II)
High Level Take-away:
Introduce Delta Update Queries as a First Class
Citizen in Query Execution
Check out our code!
1. Code Preview: http://github.com/amplab/bootstrap-sql.
Send us an email to kaizeng@cs.berkeley.eduand
sameer@databricks.comto get access!
2. Spark Package in July’15
3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond
Conclusion
1. Continuous QueryExecution on
Samples of Data is an important means
to achieve interactivityin processing
large datasets
2. New SparkSQL Libraries:
- BlinkDB for Continuous Error Bars
- G-OLA for Continuous Partial Answers
Thank you.
SameerAgarwal(sameer@databricks.com)
KaiZeng(kaizeng@cs.berkeley.edu)

Contenu connexe

Tendances

Bakers and Philosophers
Bakers and PhilosophersBakers and Philosophers
Bakers and PhilosophersDavid Evans
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)David Evans
 
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksSegmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksDavid Evans
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixCodemotion Tel Aviv
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Hamza Aslam
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Yandex
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...Databricks
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoffmrphilroth
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 

Tendances (20)

Bakers and Philosophers
Bakers and PhilosophersBakers and Philosophers
Bakers and Philosophers
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
Managing Memory
Managing MemoryManaging Memory
Managing Memory
 
From Trill to Quill and Beyond
From Trill to Quill and BeyondFrom Trill to Quill and Beyond
From Trill to Quill and Beyond
 
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksSegmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace
 
Solr sparse faceting
Solr sparse facetingSolr sparse faceting
Solr sparse faceting
 
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 
Synchronization
SynchronizationSynchronization
Synchronization
 
Ember
EmberEmber
Ember
 
JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 

Similaire à BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraSveta Smirnova
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineDatabricks
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CSusumu Tokumoto
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsDmitryZaitsev5
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 
RxJava In Baby Steps
RxJava In Baby StepsRxJava In Baby Steps
RxJava In Baby StepsAnnyce Davis
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdfFrangoCamila
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsQuoc-Sang Phan
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
 
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Flink Forward
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingVasia Kalavri
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLNicolas Poggi
 

Similaire à BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley (20)

Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for C
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri nets
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
RxJava In Baby Steps
RxJava In Baby StepsRxJava In Baby Steps
RxJava In Baby Steps
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
 
Model-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical ConstraintsModel-counting Approaches For Nonlinear Numerical Constraints
Model-counting Approaches For Nonlinear Numerical Constraints
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processing
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 

Dernier

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Dernier (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

  • 1. BlinkDB and G-OLA: Supporting Approximate Answers in SparkSQL Sameer Agarwal and Kai Zeng Spark Summit | San Francisco, CA | June 15th 2015
  • 2. About Us 1. Sameer Agarwal - Software Engineer at Databricks - PhD in Databases (UC Berkeley) - Research on ApproximateQuery Processing (BlinkDB) 2. Kai Zeng - Post-doc in AMP Lab/ Intern at Databricks - PhD in Databases (UCLA) - Research on ApproximateQuery Processing (ABM)
  • 3. Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 100 TB on 1000 machines Continuous Query Execution on Samples of Data
  • 4. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Continuous Query Execution on Samples What is the average latency in the table? 34.6667
  • 5. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 Continuous Query Execution on Samples
  • 6. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 ± 2.1 Continuous Query Execution on Samples
  • 7. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 ± 2.1 33.83 ± 1.3 Continuous Query Execution on Samples
  • 8. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1 Continuous Query Execution on Samples
  • 10. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples
  • 11. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples G-OLA
  • 12. Interface val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  batch  processing val result  =  dataFrame.collect()  //  34.6667
  • 13. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online onlineDataFrame.collectNext() //  35  ± 2.1 onlineDataFrame.collectNext() //  33.83  ± 1.3 Interface
  • 14. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext())  { onlineDataFrame.collectNext() } Interface
  • 15. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && responseTime <=  10.seconds)  { onlineDataFrame.collectNext() } Interface
  • 16. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && errorBound >=  0.01)  { onlineDataFrame.collectNext() } Interface
  • 17. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } Interface
  • 18. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } AGGREGATES/  UDAFs JOINS/GROUP  BYs NESTED  QUERIES Interface
  • 19. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples
  • 20. SELECT foo (*) FROM TABLE A ± ε Query Interface Error Estimation Query Execution Data Storage Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys 2013. Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion Stoica, Michael Jordan. A General Bootstrap Performance Diagnostic. In ACM KDD 2013 Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar,Michael Jordan, Samuel Madden, Barzan Mozafari, Ion Stoica. Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems. In ACM SIGMOD 2014. Continuous Query Execution on Samples
  • 21. Focused on estimating aggregate errors given representative samples Central LimitTheorem (CLT) Error Estimation using Bootstrap HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93 21 Error Estimation on a Sample of Data d predicate for the query) The following results are (asymptotically in sample size) true, but not di- rectly useful, since they depend on unknown properties of the underlying dis- tribution. In all cases we just plug in the sample values. For example, instead of µ we use 1 n Pn i=1 Xi where Xi is the ith sample value. Note that for estimators other than sum and count, I assume no filtering (p = 1). Filtering will increase variance a bit, or potentially a lot for extremely selective queries (p = 0). I can compute the filtering-adjusted values if you like. 1. Count: N(np, n(1 p)p) 2. Sum: N(npµ, np( 2 + (1 p)µ2 )) 3. Mean: N(µ, 2 /n) 4. Variance: N( 2 , (µ4 4 )/n) 5. Stddev: N( , (µ4 4 )/(4 2 n)) Sampling! … Resampling! D S100 S1 S θ(S1 ) θ(S100 )θ(S) 95% confidence interval! …
  • 22. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap What is the average latency in the table?
  • 23. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  • 24. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  • 25. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 ID City Latency 1 SLC 37 2 NYC 30 3 SLC 34 4 LA 36 5 NYC 30 ID City Latency 1 SLC 34 2 SLC 37 3 SLC 34 4 LA 36 5 LA 36 ... θ1 = 34.6 ... 35 ± 1.6 θ2 = 33.4 θ100 = 35.4
  • 26. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 Error Estimation in BlinkDB Leverage Poissonized Resampling to generate samples with replacement
  • 27. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 Sample from a Poisson (1) Distribution θ1 = 33.8 Error Estimation in BlinkDB
  • 28. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 6 SF 28 2 Incremental Error Estimation Error Estimation in BlinkDB
  • 29. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 #2 1 NYC 30 2 1 2 NYC 38 1 0 3 SLC 34 0 2 4 SLC 34 1 2 5 SLC 37 1 0 6 SF 28 2 1 Construct all Resamples in a Single Pass Error Estimation in BlinkDB 0.2-0.5% additional overhead
  • 30. High Level Take-away: Bootstrap and Poissonized Resampling Techniques are the key towards achieving quick and continuous error bars for a general set of queries 30
  • 31. SELECT foo (*) FROM TABLE A ± ε Query Interface Error Estimation Query Execution Data Storage G-OLA Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust and Ion Stoica. G-OLA: Generalized Online Aggregation for Interactive Analysis on Big Data. In SIGMOD 2015. Continuous Query Execution on Samples
  • 32. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec
  • 33. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  • 34. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A” 10+10+10  sec
  • 35. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A” 10+10+10  sec Overall Quadratic Cost!
  • 36. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  • 37. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  • 38. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A
  • 39. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A
  • 40. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A Delta Update Query
  • 47. 47 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A
  • 48. 48 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A
  • 49. 49 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A AVG SCAN A
  • 50. 50 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A (I)
  • 51. 51 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A A’ A’ (I) (II)
  • 52. 52 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A (I)
  • 53. 53 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A A ± ε (I)
  • 54. 54 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 (I)
  • 55. 55 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 latency < 8 (I)
  • 56. 56 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 latency > 12 (I)
  • 57. 57 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 8 < latency < 12 (I)
  • 58. 58 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 8 < latency < 12 (I) (II)
  • 59. High Level Take-away: Introduce Delta Update Queries as a First Class Citizen in Query Execution
  • 60. Check out our code! 1. Code Preview: http://github.com/amplab/bootstrap-sql. Send us an email to kaizeng@cs.berkeley.eduand sameer@databricks.comto get access! 2. Spark Package in July’15 3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond
  • 61. Conclusion 1. Continuous QueryExecution on Samples of Data is an important means to achieve interactivityin processing large datasets 2. New SparkSQL Libraries: - BlinkDB for Continuous Error Bars - G-OLA for Continuous Partial Answers