2. Who we are?
Stephan Kessler
Developer @ SAP, Walldorf
o SAP HANA Vora team
o Integration of Vora query engine with
Apache Spark.
o Bringing new features and performance
improvements to Apache Spark.
o Before joining SAP: PhD and M.Sc. at the
Karlsruhe Institute of Technology.
o Research on privacy in databases and
sensor networks.
Santiago Mola
Developer @ Stratio, Madrid
o Working with the SAP HANA Vora team
o Focus on Apache Spark SQL extensions and data
sources implementation.
o Bootstrapped Stratio Sparkta, worked on Stratio
Ingestion and helped customers to build stream
processing solutions.
o Previously: CTO at Bitsnbrains, M.Sc. at Polytechnic
University of Valencia.
3. SAP HANA Vora
• SAP HANA Vora is a SQL-on-Hadoop solution based on:
– In-Memory columnar query execution engine with built-in query
compilation
– Spark SQL extensions (will be Open Source soon!):
• OLAP extensions
• Hierarchy queries
• Extended Data Sources API (‘Push Down Everything’)
4. Spark SQL
Data Sources API
Spark Core Engine
Data Sources
MLlib Streaming …
CSV HANA
HANA VORA
5. Motivation
• “The fastest way of processing data is not processing it at all!”
• Data Sources API allows to defer computation of filters and projects to
the ‘source’
– Less I/O spent reading
– Less memory spent
• But: Data Sources can also be full-blown databases
– Deferring parts of the logical plan leads to
additional benefits
→ The Pushdown of Everything
Pushed down:
Project: Column1
Filter: Column2 > 20
Average: Column2
6. Implementing a Data Source
1. Creating a ‘DefaultSource’ class that implements the trait
(Schema)RelationProvider
trait SchemaRelationProvider {
def createRelation(
sqlContext: SQLContext, parameters: Map[String, String],
schema: StructType): BaseRelation
}
2. The returned “BaseRelation” can implement the following traits
– TableScan
– PrunedScan
– PrunedFilterScan
7. Full Scan
• The most basic form of reading data: read it all, sequentially.
• Implementing trait table scan
trait TableScan {
def buildScan(): RDD[Row]
}
• SQL: SELECT * FROM table
8. Pruned Scan
• Read all rows, only a few columns
• Implementing trait PrunedScan
trait PrunedScan {
def buildScan(requiredColumns: Array[String]): RDD[Row]
}
• SQL: SELECT <column list> FROM table
9. Pruned Filtered Scan
• Can filter which rows are fetched (predicate push down).
• Implement trait PrunedFilteredScan
trait PrunedFilteredScan {
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}
• SQL: SELECT <column list> FROM table WHERE <predicate>
• Spark SQL allows basic predicates here (e.g. EqualTo, GreaterThan).
10. How does it work?
Assume the following table attendees
Query:
SELECT hometown, AVG(age) FROM attendees
WHERE hometown = ’Amsterdam’
GROUP BY hometown Name Age Hometown
Peter 23 London
John 30 New York
Stephan 72 Karlsruhe
… … …
11. How does it work?
Query:
SELECT hometown, AVG(age) FROM attendees
WHERE hometown = ’Amsterdam’
GROUP BY hometown
The query is parsed into this Logical Plan:
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
12. Example with TableScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
Planning
PhysicalRDD
(full scan)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SQL
SELECT name, age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
13. Example with TableScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(full scan)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT name, age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
14. Example with PrunedScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(pruned: age, hometown)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
15. Example with
PrunedFilteredScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(pruned: age, hometown
filtered: hometown = ‘Amsterdam’)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT age, hometown
FROM attendees
WHERE hometown = ‘Amsterdam’
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
16. How can we improve this?
• There are sources doing more than filtering and pruning
– aggregation, joins, ...
• Some sources can execute more complex filters and functions
– Example: SELECT col1 + 1 WHERE col2 + col3 < col4.
• Default Data Sources API cannot push down these things
– They might be trivial for the data source to execute.
• This leads to unnecessary work
– fetching more data
– Not using optimizations of the source
17. Enter the Catalyst Source API
• We implemented a new interface that data sources can implement to
signal that they can push down complex queries.
• Complexity of pushed down queries is arbitrary
– functions, set operators, joins, deeply nested subqueries, …
– even data source UDFs that are not supported in Spark).
trait CatalystSource {
def isMultiplePartitionExecution(relations: Seq[CatalystSource]): Boolean
def supportsLogicalPlan(plan: LogicalPlan): Boolean
def supportsExpression(expr: Expression): Boolean
def logicalPlanToRDD(plan: LogicalPlan): RDD[Row]
}
18. Partitioned and Holistic sources
• Data sources that can compute queries that operate on a holistic data set
– HANA, Cassandra, PostgreSQL, MongoDB
• Data sources that can compute queries that operate only over each
partition
– Vora, Parquet, ORC, PostgreSQL instances in Postgres XL
• Some can do both (to some degree)
• Our planner extensions allow to optimize push down for both cases if the
data source implements the Catalyst Source API.
19. Partitioned vs. Holistic Sources
HDFS
Physical
Node
Physical
Node
Physical
Node
Data Node Data Node Data Node
Vora
Engine
Vora
Engine
Vora
Engine
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
SAP
HANA
Postgres
SQL
…
20. Example with CatalystSource
(partioned execution)
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
Planning
PhysicalRDD
(CatalystSource)
Aggregate
(hometown, SUM(PartialSum) /
SUM(PartialCount))
Physical plan
SELECT hometown,
SUM(age) AS PartialSum,
COUNT(age) AS PartialCount
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SELECT hometown,
SUM(PartialSum) / SUM(PartialCount)
FROM source
GROUP BY hometown
SQL representation
SQL
21. Example with CatalystSource
(holistic source)
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(CatalystSource)
Physical plan
SELECT hometown, AGE(age)
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
22. Returned Rows
Assumption: Table Size is 𝒏 Rows
SELECT hometown,
SUM(age) AS PartialSum,
COUNT(age) AS PartialCount
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SELECT age, hometown
FROM attendees
WHERE hometown = ‘Amsterdam’
SELECT name, age, hometown
FROM attendees
TableScan/
Pruned Scan
Pruned Filter
Scan
Catalyst
Source
Returns
𝑛 Rows
Returns
< 𝑛 Rows
Returns
<< 𝑛 Rows
#distinct ‘hometowns’
23. Advantages
• A single interface covers all queries.
• CatalystSource subsumes TableScan, PrunedScan, PrunedFilteredScan.
• Fine-grained control of features supported by the data source
• Incremental implementation of a data source possible
– Start with supporting projects and filters and continue with more
• Opens the door to tighter integration with all kinds of databases.
– Dramatic performance improvements possible.
24. Current disadvantages and limitations
• Implementing CatalystSource for a rich data source (e.g., supporting SQL)
is a considerably complex task.
• Current implementation relies on (some) Spark APIs that are unstable.
– Backwards compatibility is not guaranteed.
• Pushing down a complex query could be slower than not pushing it down
– Examples:
• it overloads the data source
• generates a result larger than its input tables)
– CatalystSource implementors can workaround this by marking such
queries as unsupported
25. What are the next steps?
• Improve the API to make it simpler for implementors
– add utilities to generate SQL,
– matchers to simplify working with logical plans
• Provide a stable API
– CatalystSource implementations should work with different Spark
versions without modification.
• Provide a common trait to reduce boilerplate code
– Example: A data source implementing CatalystSource should not
need to implement TableScan, PrunedScan or PrunedFilteredScan.
26. Summary
• Extension of the Data Sources API to pushdown arbitrary logical plans
• Leveraging functionality of source to process less data
• Part of SAP Hana Vora
• We will put it Open Source
Notes: 30 seconds about Data Sources API intro:
Data Sources API defines how Spark SQL can interact with an external source of data.
The Data Source can represent a file format on HDFS, a relation database, a web service…
With TableScan, everything is pulled from the data source: every row with every column. Then all further steps are performed in Spark.
Clarification: Here are three columns:
Logical plan.
Physical plan.
A SQL representation with the query that is executed in the data source and the query that is executed in Spark SQL. This is just an idealization, it does not mean that the data source actually uses SQL or that Spark SQL uses it internally.
With TableScan, everything is pulled from the data source: every row with every column. Then all further steps are performed in Spark.
Clarification: Here are three columns:
Logical plan.
Physical plan.
A SQL representation with the query that is executed in the data source and the query that is executed in Spark SQL. This is just an idealization, it does not mean that the data source actually uses SQL or that Spark SQL uses it internally.
With PrunedScan, we fetch all rows with a subset of columns. This can reduce I/O considerably.
PrunedFilteredScan works as PrunedFilteredScan, but adding a filter on rows according to a condition. This is equivalent to adding a WHERE clause.