SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
User Defined Aggregation
In Apache Spark
A Love Story
Erik Erlandson
Principal Software Engineer
All Love Stories
Are The Same
Hero Meets Aggregators
Hero Files Spark JIRA
Hero Merges Spark PR
Establish
The Plot
Spark’s Scale-Out World
2
3
2
5
3
5
2
3
5
logical
Spark’s Scale-Out World
2 3 2
5 3 5
2 3 5
2
3
2
5
3
5
2
3
5
physical
logical
Scale-Out Sum
2 3 5
s
=
0
Scale-Out Sum
2 3 5
s
=
s
+
2
(2)
Scale-Out Sum
2 3 5
s
=
s
+
3
(5)
Scale-Out Sum
2 3 5
s
=
s
+
5
(10)
Scale-Out Sum
2 3 5 10
Scale-Out Sum
2 3 5 10
5 3 5 13
Scale-Out Sum
2 3 5 10
5 3 5 13
2 3 2 7
Scale-Out Sum
2 3 5 10
5 3 5 13 + 7 = 20
2 3 2
Scale-Out Sum
2 3 5 10 + 20 = 30
5 3 5
2 3 2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2)
Present
sum / count
Love Interest
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Romantic Chemistry
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5))
spark.udf.register("p90",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
Romantic Chemistry
val query = records
.writeStream //...
+---------+
|wordcount|
+---------+
| 12|
| 5|
| 9|
| 18|
| 12|
+---------+
val r = records.withColumn("time", current_timestamp())
.groupBy(window($”time”, “30 seconds”))
.agg(sketchCDF($"wordcount").alias("CDF"))
.select(callUDF("p50", $"CDF").alias("p50"),
callUDF("p90", $"CDF").alias("p90"))
val query = r.writeStream //...
+----+----+
| p50| p90|
+----+----+
|15.6|31.0|
|16.0|30.8|
|15.8|30.0|
|15.7|31.0|
|16.0|31.0|
+----+----+
Romantic Montage
Sketching Data with T-Digest In Apache Spark
Smart Scalable Feature Reduction With Random Forests
One-Pass Data Science In Apache Spark With Generative T-Digests
Apache Spark for Library Developers
Extending Structured Streaming Made Easy with Algebra
Conflict!
UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}
UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}
User Defined Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...
}
User Defined Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...
}
Expensive
What Could Go Wrong?
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def serialize(tdsql: TDigestSQL): Any = {
print(“In serialize”)
// ...
}
def deserialize(datum: Any): TDigestSQL = {
print(“In deserialize”)
// ...
}
// yada yada yada ...
}
What Could Go Wrong?
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Wait What?
val sketchCDF = tdigestUDAF[Double]
val data = /* data frame with 1000 rows of data */
val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first
In deserialize
In serialize
In deserialize
In serialize
… 997 more times !
In deserialize
In serialize
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
val updated = tdigest + input.getDouble(0) // do the actual update
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
val updated = tdigest + input.getDouble(0) // do the actual update
buf(0) = TDigestSQL(updated) // re-serialize
}
SPARK-27296
Resolution
#25024
Aggregator Anatomy
class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends
Aggregator[Double, TDigestSQL, TDigestSQL] {
def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a)
def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL =
TDigestSQL(b1.tdigest ++ b2.tdigest)
def finish(b: TDigestSQL): TDigestSQL = b
val serde = ExpressionEncoder[TDigestSQL]()
def bufferEncoder: Encoder[TDigestSQL] = serde
def outputEncoder: Encoder[TDigestSQL] = serde
}
Intuitive Serialization
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Custom Aggregation in Spark 3.0
import org.apache.spark.sql.functions.udaf
val sketchAgg = TDigestAggregator(0.5, 0)
val sketchCDF: UserDefinedFunction = udaf(sketchAgg)
val sketch = data.agg(sketchCDF($”column”)).first
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
70x
Faster
Epilogue
Don’t Give Up
Patience
Respect
ErikE ErErlandson
Principal Software Engineer
Erik Erlandson
eje@redhat.com
@ManyAngled

Contenu connexe

Tendances

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetInfluxData
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3DataWorks Summit
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzDataStax Academy
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021StreamNative
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 

Tendances (20)

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 

Similaire à User Defined Aggregation in Apache Spark: A Love Story

Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinDmitry Pranchuk
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik ErlandsonDatabricks
 
Chainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみたChainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみたAkira Maruoka
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31Mahmoud Samir Fayed
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data ManipulationChu An
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJSKyung Yeol Kim
 
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologistyoavrubin
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions Dr. Volkan OBAN
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196Mahmoud Samir Fayed
 

Similaire à User Defined Aggregation in Apache Spark: A Love Story (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
 
Chainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみたChainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみた
 
Scala by Luc Duponcheel
Scala by Luc DuponcheelScala by Luc Duponcheel
Scala by Luc Duponcheel
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
 
SDC - Einführung in Scala
SDC - Einführung in ScalaSDC - Einführung in Scala
SDC - Einführung in Scala
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJS
 
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Availablegargpaaro
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 

Dernier (20)

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

User Defined Aggregation in Apache Spark: A Love Story

  • 1. User Defined Aggregation In Apache Spark A Love Story Erik Erlandson Principal Software Engineer
  • 2. All Love Stories Are The Same Hero Meets Aggregators Hero Files Spark JIRA Hero Merges Spark PR
  • 5. Spark’s Scale-Out World 2 3 2 5 3 5 2 3 5 2 3 2 5 3 5 2 3 5 physical logical
  • 7. Scale-Out Sum 2 3 5 s = s + 2 (2)
  • 8. Scale-Out Sum 2 3 5 s = s + 3 (5)
  • 9. Scale-Out Sum 2 3 5 s = s + 5 (10)
  • 11. Scale-Out Sum 2 3 5 10 5 3 5 13
  • 12. Scale-Out Sum 2 3 5 10 5 3 5 13 2 3 2 7
  • 13. Scale-Out Sum 2 3 5 10 5 3 5 13 + 7 = 20 2 3 2
  • 14. Scale-Out Sum 2 3 5 10 + 20 = 30 5 3 5 2 3 2
  • 15. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2
  • 16. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2)
  • 17. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) Present sum / count
  • 19. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  • 20. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  • 21. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  • 22. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  • 23. Romantic Chemistry val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  • 24. Romantic Chemistry val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  • 25. Romantic Montage Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass Data Science In Apache Spark With Generative T-Digests Apache Spark for Library Developers Extending Structured Streaming Made Easy with Algebra
  • 27. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  • 28. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  • 29. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... }
  • 30. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... } Expensive
  • 31. What Could Go Wrong? class TDigestUDT extends UserDefinedType[TDigestSQL] { def serialize(tdsql: TDigestSQL): Any = { print(“In serialize”) // ... } def deserialize(datum: Any): TDigestSQL = { print(“In deserialize”) // ... } // yada yada yada ... }
  • 32. What Could Go Wrong? 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  • 33. Wait What? val sketchCDF = tdigestUDAF[Double] val data = /* data frame with 1000 rows of data */ val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first In deserialize In serialize In deserialize In serialize … 997 more times ! In deserialize In serialize
  • 34. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { }
  • 35. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize }
  • 36. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update }
  • 37. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update buf(0) = TDigestSQL(updated) // re-serialize }
  • 41. Aggregator Anatomy class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends Aggregator[Double, TDigestSQL, TDigestSQL] { def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a) def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL = TDigestSQL(b1.tdigest ++ b2.tdigest) def finish(b: TDigestSQL): TDigestSQL = b val serde = ExpressionEncoder[TDigestSQL]() def bufferEncoder: Encoder[TDigestSQL] = serde def outputEncoder: Encoder[TDigestSQL] = serde }
  • 42. Intuitive Serialization 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  • 43. Custom Aggregation in Spark 3.0 import org.apache.spark.sql.functions.udaf val sketchAgg = TDigestAggregator(0.5, 0) val sketchCDF: UserDefinedFunction = udaf(sketchAgg) val sketch = data.agg(sketchCDF($”column”)).first
  • 44. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
  • 45. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112) 70x Faster
  • 50. ErikE ErErlandson Principal Software Engineer Erik Erlandson eje@redhat.com @ManyAngled