Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing — based on the chosen sample size — can make a systematic trade-off between the output accuracy and computation efficiency. Unfortunately, state-of-the-art systems for approximate computing, such as BlinkDB, ApproxHadoop, primarily target batch analytics, where the input data remains unchanged during the course of sampling. Thus, they are not well-suited for stream analytics. In this talk, we will present the design of StreamApprox, a Flink-based stream analytics system for approximate computing. StreamApprox implements an online stratified reservoir sampling algorithm in Apache Flink to produce approximate output with rigorous error bounds.
4. Approximate Computing
3
Many applications:
Approximate output is good enough!
E.g. : Google Trends -‐-‐-‐ Big Data vs Machine Learning (Sep/2012 – Sep/2017)
The trend of data is more important than the precise numbers
5. Approximate Computing
4
Take a
sample Approximate output
± Error bound Compute
Approximate computing
Idea: To achieve low latency, compute over a sub-‐set of data items
instead of the entire data-‐set
12. Spark-‐based Sampling
11
Step
#1
Create strata
using groupByKey()
Step
#3
Synchronize between
worker nodes
to select a
sample of size k
Step
#2
Apply SRS
to each stratum Si
These steps are very expensive
Spark-‐based Stratified Sampling (Spark-‐based STS)
13. StreamApprox: Core idea
12
Easy to parallelize, doesn't
need any synchronization
between workers
RS Weight = #items/k = 6/4S2
RS
S3
Weight = 1
S1
RS
Size of reservoir = k
Weight = #items/k = 8/4
RS : Reservoir Sampling
k = 4
Online Adaptive Stratified Reservoir Sampling (OASRS)