SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
Deep Dive Into Catalyst:
Apache Spark 2.0’s Optimizer
Yin Huai
Spark Summit 2016
2
SELECT count(*)
FROM (
SELECT t1.id
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Write Programs Using RDD API
Solution 1
3
val count = t1.cartesian(t2).filter {
case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2
}.filter {
case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000
}.map {
case (id1FromT1, id2FromT2) => id1FromT1
}.count
println("Count: " + count)
t1 join t2
t1.id = t2.id
t2.id > 50 * 1000
Solution 2
4
val filteredT2 =
t2.filter(id1FromT2 => id1FromT2 > 50 * 1000)
val preparedT1 =
t1.map(id1FromT1 => (id1FromT1, id1FromT1))
val preparedT2 =
filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2))
val count = preparedT1.join(preparedT2).map {
case (id1FromT1, _) => id1FromT1
}.count
println("Count: " + count)
t2.id > 50 * 1000
t1 join t2
WHERE t1.id = t2.id
Solution 1 vs. Solution 2
0.26
53.92
Executiontime(s) Solution 1 Solution 2
5
~ 200x
Solution 1
6
val count = t1.cartesian(t2).filter {
case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2
}.filter {
case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000
}.map {
case (id1FromT1, id2FromT2) => id1FromT1
}.count
println("Count: " + count)
t1 join t2
t1.id = t2.id
t2.id > 50 * 1000
Solution 2
7
val filteredT2 =
t2.filter(id1FromT2 => id1FromT2 > 50 * 1000)
val preparedT1 =
t1.map(id1FromT1 => (id1FromT1, id1FromT1))
val preparedT2 =
filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2))
val count = preparedT1.join(preparedT2).map {
case (id1FromT1, _) => id1FromT1
}.count
println("Count: " + count)
t2.id > 50 * 1000
t1 join t2
WHERE t1.id = t2.id
Write Programs Using RDD API
• Users’ functions are black boxes
• Opaque computation
• Opaque data type
• Programs built using RDD API have total control on
how to executeevery data operation
• Developers have to write efficient programs for
different kinds of workloads
8
9
Is there an easy way to write efficient programs?
10
The easiest way to write efficient
programs is to not worry about it and get
your programs automatically optimized
How?
• Write programs usinghigh level programminginterfaces
• Programs are used to describe what data operations are needed
without specifying how to executethoseoperations
• High level programming interfaces: SQL, DataFrame, and Dataset
• Get an optimizer thatautomatically finds outthe most
efficientplan to execute data operationsspecifiedin the
user’sprogram
11
12
Catalyst:
Apache Spark’s Optimizer
13
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
CodeGenerationTransformations
Catalyst
Abstractionsof users’programs
(Trees)
14
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
CodeGenerationTransformations
Catalyst
Abstractions of users’ programs
(Trees)
15
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
16
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Expression
• An expressionrepresentsa
new value, computed based
on input values
• e.g. 1 + 2 + t1.value
• Attribute: A column of a
dataset (e.g. t1.id) or a
column generatedby a
specific data operation (e.g.
v)
17
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Query Plan
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Logical Plan
• A Logical Plan describescomputation
on datasets without defining how to
conductthe computation
• output: a list of attributes generated
by this Logical Plan, e.g. [id, v]
• constraints: a setof invariants about
the rows generatedby this plan, e.g.
t2.id > 50 * 1000
18
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Physical Plan
• A Physical Plan describescomputation
on datasets with specific definitions on
how to conductthe computation
• A Physical Plan is executable
19
Parquet Scan
(t1)
JSONScan
(t2)
Sort-Merge
Join
Filter
Project
Hash-
Aggregate
sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
20
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
CodeGenerationTransformations
Catalyst
Abstractionsof users’programs
(Trees)
Transformations
• Transformations without changing the tree type
(Transformand Rule Executor)
• Expression => Expression
• Logical Plan => Logical Plan
• Physical Plan => Physical Plan
• Transforming a tree to another kind of tree
• Logical Plan => Physical Plan
21
• A function associated with everytree used to
implement a single rule
Transform
22
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2
for every row
Transform
• A transformation is defined as a Partial Function
• Partial Function: A function that is defined for a subset
of its possible arguments
23
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Case statement determineifthe partialfunction is definedfora given input
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
24
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
25
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
26
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
27
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.value
Combining Multiple Rules
28
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Predicate Pushdown
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t2.id>50*1000
t1.id=t2.id
Combining Multiple Rules
29
Constant Folding
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t2.id>50*1000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Combining Multiple Rules
30
Column Pruning
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Project
t1.id
t1.value t2.id
Combining Multiple Rules
31
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Projectt1.id
t1.value
t2.id
Before transformations
After transformations
Combining Multiple Rules: Rule Executor
32
Batch1 Batch2 Batchn…
Rule 1
Rule 2
…
Rule 1
Rule 2
…
A Rule Executor transforms a Tree to anothersame type Tree by
applying many rulesdefined in batches
Every rule is implemented
based on Transform
Approaches of
applying rules
1. Fixed point
2. Once
Transformations
• Transformations without changing the tree type
(Transformand Rule Executor)
• Expression => Expression
• Logical Plan => Logical Plan
• Physical Plan => Physical Plan
• Transforming a tree to another kind of tree
• Logical Plan => Physical Plan
33
From Logical Plan to Physical Plan
• A Logical Plan is transformedto a Physical Plan by applying
a set of Strategies
• Every Strategy uses patternmatchingto converta Tree to
anotherkind of Tree
34
object BasicOperators extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
…
case logical.Project(projectList, child) =>
execution.ProjectExec(projectList, planLater(child)) :: Nil
case logical.Filter(condition, child) =>
execution.FilterExec(condition, planLater(child)) :: Nil
…
}
} Triggers other Strategies
35
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Selected
Physical Plan
CostModel
Physical
Plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
Catalyst
36
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Selected
Physical Plan
CostModel
Physical
Plans
Catalog
Analysis
Logical
Optimization
Physical
Planning
• Analysis(Rule Executor):Transformsan UnresolvedLogical
Plan to a ResolvedLogical Plan
• Unresolved => Resolved: Use Catalog to find where datasets and
columns are coming from and types of columns
• Logical Optimization (Rule Executor): Transformsa Resolved
Logical Plan to an Optimized Logical Plan
• Physical Planning (Strategies+ Rule Executor):Transformsa
Optimized Logical Plan to a Physical Plan
Spark’s Planner
• 1st Phase: Transformsthe Logical Plan to the Physical
Plan using Strategies
• 2nd Phase: Use a Rule Executor to make the Physical
Plan readyfor execution
• Prepare Scalar sub-queries
• Ensure requirementson inputrows
• Apply physical optimizations
37
Ensure Requirements on Input Rows
38
Sort-MergeJoin
[t1.id = t2.id]
t1
Requirements:
Rows Sorted by t1.id
t2
Requirements:
Rows Sorted by t2.id
Ensure Requirements on Input Rows
39
Sort-MergeJoin
[t1.id = t2.id]
t1
Requirements:
Rows Sorted by t1.id
Requirements:
Rows Sorted by t2.id
Sort
[t1.id]
t2
Sort
[t2.id]
Ensure Requirements on Input Rows
40
Sort-MergeJoin
[t1.id = t2.id]
t1
[sortedby t1.id]
Requirements:
Rows Sorted by t1.id
Requirements:
Rows Sorted by t2.id
Sort
[t1.id]
t2
Sort
[t2.id]What if t1 has been
sorted by t1.id?
Ensure Requirements on Input Rows
41
Sort-MergeJoin
[t1.id = t2.id]
t1
[sortedby t1.id]
Requirements:
Rows Sorted by t1.id
Requirements:
Rows Sorted by t2.id
t2
Sort
[t2.id]
42
Catalyst in Apache Spark
43
Spark Core (RDD)
Catalyst
DataFrame/DatasetSQL
ML Pipelines
Structured
Streaming
{ JSON }
JDBC
and more…
With Spark 2.0, we expect
most users to migrate to
high level APIs (SQL,
DataFrame, and Dataset)
Spark SQL
GraphFrames
Where to Start
• SourceCode:
• Trees: TreeNode,Expression, Logical Plan, and Physical Plan
• Transformations: Analyzer,Optimizer, and Planner
• Checkout previous pull requests
• Start to write code using Catalyst
44
45
~ 200 lines of changes
(95 lines are for tests)
Certain workloads can
run hundreds times
faster
SPARK-12032
46
~ 250 lines of changes
(99 lines are for tests)
Pivot table support
SPARK-8992
• Try latestversion of ApacheSpark and preview of Spark 2.0
Try Apache Spark with Databricks
47
http://databricks.com/try
Thank you.
Officehour:2:45pm–3:30pm @ ExpoHall

Contenu connexe

Tendances

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 

Tendances (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Use r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkrUse r tutorial part1, introduction to sparkr
Use r tutorial part1, introduction to sparkr
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 

En vedette

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuSpark Summit
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16pdx_spark
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 

En vedette (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 

Similaire à Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Qbeast
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Model Transformation Reuse
Model Transformation ReuseModel Transformation Reuse
Model Transformation Reusemiso_uam
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streamingTao Li
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareESUG
 
SQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis ReznikSQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis ReznikSigma Software
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
Introduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalIntroduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalM Malai
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerMydbops
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 

Similaire à Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer (20)

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Meetup talk
Meetup talkMeetup talk
Meetup talk
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Model Transformation Reuse
Model Transformation ReuseModel Transformation Reuse
Model Transformation Reuse
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Deep dive into spark streaming
Deep dive into spark streamingDeep dive into spark streaming
Deep dive into spark streaming
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 
SQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis ReznikSQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis Reznik
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Introduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalIntroduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-final
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizer
 
Javascript.ppt
Javascript.pptJavascript.ppt
Javascript.ppt
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 

Dernier (20)

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer

  • 1. Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer Yin Huai Spark Summit 2016
  • 2. 2 SELECT count(*) FROM ( SELECT t1.id FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Write Programs Using RDD API
  • 3. Solution 1 3 val count = t1.cartesian(t2).filter { case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2 }.filter { case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000 }.map { case (id1FromT1, id2FromT2) => id1FromT1 }.count println("Count: " + count) t1 join t2 t1.id = t2.id t2.id > 50 * 1000
  • 4. Solution 2 4 val filteredT2 = t2.filter(id1FromT2 => id1FromT2 > 50 * 1000) val preparedT1 = t1.map(id1FromT1 => (id1FromT1, id1FromT1)) val preparedT2 = filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2)) val count = preparedT1.join(preparedT2).map { case (id1FromT1, _) => id1FromT1 }.count println("Count: " + count) t2.id > 50 * 1000 t1 join t2 WHERE t1.id = t2.id
  • 5. Solution 1 vs. Solution 2 0.26 53.92 Executiontime(s) Solution 1 Solution 2 5 ~ 200x
  • 6. Solution 1 6 val count = t1.cartesian(t2).filter { case (id1FromT1, id2FromT2) => id1FromT1 == id2FromT2 }.filter { case (id1FromT1, id2FromT2) => id1FromT2 > 50 * 1000 }.map { case (id1FromT1, id2FromT2) => id1FromT1 }.count println("Count: " + count) t1 join t2 t1.id = t2.id t2.id > 50 * 1000
  • 7. Solution 2 7 val filteredT2 = t2.filter(id1FromT2 => id1FromT2 > 50 * 1000) val preparedT1 = t1.map(id1FromT1 => (id1FromT1, id1FromT1)) val preparedT2 = filteredT2.map(id1FromT2 => (id1FromT2, id1FromT2)) val count = preparedT1.join(preparedT2).map { case (id1FromT1, _) => id1FromT1 }.count println("Count: " + count) t2.id > 50 * 1000 t1 join t2 WHERE t1.id = t2.id
  • 8. Write Programs Using RDD API • Users’ functions are black boxes • Opaque computation • Opaque data type • Programs built using RDD API have total control on how to executeevery data operation • Developers have to write efficient programs for different kinds of workloads 8
  • 9. 9 Is there an easy way to write efficient programs?
  • 10. 10 The easiest way to write efficient programs is to not worry about it and get your programs automatically optimized
  • 11. How? • Write programs usinghigh level programminginterfaces • Programs are used to describe what data operations are needed without specifying how to executethoseoperations • High level programming interfaces: SQL, DataFrame, and Dataset • Get an optimizer thatautomatically finds outthe most efficientplan to execute data operationsspecifiedin the user’sprogram 11
  • 13. 13 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractionsof users’programs (Trees)
  • 14. 14 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractions of users’ programs (Trees)
  • 15. 15 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp
  • 16. 16 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Expression • An expressionrepresentsa new value, computed based on input values • e.g. 1 + 2 + t1.value • Attribute: A column of a dataset (e.g. t1.id) or a column generatedby a specific data operation (e.g. v)
  • 17. 17 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Query Plan Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 18. Logical Plan • A Logical Plan describescomputation on datasets without defining how to conductthe computation • output: a list of attributes generated by this Logical Plan, e.g. [id, v] • constraints: a setof invariants about the rows generatedby this plan, e.g. t2.id > 50 * 1000 18 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 19. Physical Plan • A Physical Plan describescomputation on datasets with specific definitions on how to conductthe computation • A Physical Plan is executable 19 Parquet Scan (t1) JSONScan (t2) Sort-Merge Join Filter Project Hash- Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 20. 20 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs CodeGenerationTransformations Catalyst Abstractionsof users’programs (Trees)
  • 21. Transformations • Transformations without changing the tree type (Transformand Rule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforming a tree to another kind of tree • Logical Plan => Physical Plan 21
  • 22. • A function associated with everytree used to implement a single rule Transform 22 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2 for every row
  • 23. Transform • A transformation is defined as a Partial Function • Partial Function: A function that is defined for a subset of its possible arguments 23 val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Case statement determineifthe partialfunction is definedfora given input
  • 24. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 24 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 25. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 25 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 26. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 26 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 27. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 27 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.value
  • 28. Combining Multiple Rules 28 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Predicate Pushdown Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id
  • 29. Combining Multiple Rules 29 Constant Folding Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id
  • 30. Combining Multiple Rules 30 Column Pruning Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Project t1.id t1.value t2.id
  • 31. Combining Multiple Rules 31 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Projectt1.id t1.value t2.id Before transformations After transformations
  • 32. Combining Multiple Rules: Rule Executor 32 Batch1 Batch2 Batchn… Rule 1 Rule 2 … Rule 1 Rule 2 … A Rule Executor transforms a Tree to anothersame type Tree by applying many rulesdefined in batches Every rule is implemented based on Transform Approaches of applying rules 1. Fixed point 2. Once
  • 33. Transformations • Transformations without changing the tree type (Transformand Rule Executor) • Expression => Expression • Logical Plan => Logical Plan • Physical Plan => Physical Plan • Transforming a tree to another kind of tree • Logical Plan => Physical Plan 33
  • 34. From Logical Plan to Physical Plan • A Logical Plan is transformedto a Physical Plan by applying a set of Strategies • Every Strategy uses patternmatchingto converta Tree to anotherkind of Tree 34 object BasicOperators extends Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { … case logical.Project(projectList, child) => execution.ProjectExec(projectList, planLater(child)) :: Nil case logical.Filter(condition, child) => execution.FilterExec(condition, planLater(child)) :: Nil … } } Triggers other Strategies
  • 35. 35 SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning Catalyst
  • 36. 36 Unresolved Logical Plan Logical Plan Optimized Logical Plan Selected Physical Plan CostModel Physical Plans Catalog Analysis Logical Optimization Physical Planning • Analysis(Rule Executor):Transformsan UnresolvedLogical Plan to a ResolvedLogical Plan • Unresolved => Resolved: Use Catalog to find where datasets and columns are coming from and types of columns • Logical Optimization (Rule Executor): Transformsa Resolved Logical Plan to an Optimized Logical Plan • Physical Planning (Strategies+ Rule Executor):Transformsa Optimized Logical Plan to a Physical Plan
  • 37. Spark’s Planner • 1st Phase: Transformsthe Logical Plan to the Physical Plan using Strategies • 2nd Phase: Use a Rule Executor to make the Physical Plan readyfor execution • Prepare Scalar sub-queries • Ensure requirementson inputrows • Apply physical optimizations 37
  • 38. Ensure Requirements on Input Rows 38 Sort-MergeJoin [t1.id = t2.id] t1 Requirements: Rows Sorted by t1.id t2 Requirements: Rows Sorted by t2.id
  • 39. Ensure Requirements on Input Rows 39 Sort-MergeJoin [t1.id = t2.id] t1 Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id Sort [t1.id] t2 Sort [t2.id]
  • 40. Ensure Requirements on Input Rows 40 Sort-MergeJoin [t1.id = t2.id] t1 [sortedby t1.id] Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id Sort [t1.id] t2 Sort [t2.id]What if t1 has been sorted by t1.id?
  • 41. Ensure Requirements on Input Rows 41 Sort-MergeJoin [t1.id = t2.id] t1 [sortedby t1.id] Requirements: Rows Sorted by t1.id Requirements: Rows Sorted by t2.id t2 Sort [t2.id]
  • 43. 43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL ML Pipelines Structured Streaming { JSON } JDBC and more… With Spark 2.0, we expect most users to migrate to high level APIs (SQL, DataFrame, and Dataset) Spark SQL GraphFrames
  • 44. Where to Start • SourceCode: • Trees: TreeNode,Expression, Logical Plan, and Physical Plan • Transformations: Analyzer,Optimizer, and Planner • Checkout previous pull requests • Start to write code using Catalyst 44
  • 45. 45 ~ 200 lines of changes (95 lines are for tests) Certain workloads can run hundreds times faster SPARK-12032
  • 46. 46 ~ 250 lines of changes (99 lines are for tests) Pivot table support SPARK-8992
  • 47. • Try latestversion of ApacheSpark and preview of Spark 2.0 Try Apache Spark with Databricks 47 http://databricks.com/try