SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
SPATIAL QUERY ON VANILLA DATABASES
Julian Hyde (Calcite PMC)
Spatial Query on Vanilla Databases
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like
r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people
are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as
HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements
much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes.
Calcite is bringing spatial query to the masses!
@julianhyde
SQL
Query planning
Query federation
BI & OLAP
Streaming
Hadoop
ASF member
Original author of Apache Calcite
PMC Apache Arrow, Calcite, Drill, Eagle, Kylin
Architect at Looker
Apache Calcite
Apache top-level project since 2015
Query planning framework used in many
projects and products
Also works standalone: embedded federated
query engine with SQL / JDBC front end
Apache community development model
https://calcite.apache.org
https://github.com/apache/calcite
SELECT d.name, COUNT(*) AS c
FROM Emps AS e
JOIN Depts AS d USING (deptno)
WHERE e.age < 40
GROUP BY d.deptno
HAVING COUNT(*) > 5
ORDER BY c DESC
Relational algebra
Based on set theory, plus operators:
Project, Filter, Aggregate, Union, Join,
Sort
Requires: declarative language (SQL),
query planner
Original goal: data independence
Enables: query optimization, new
algorithms and data structures
Scan [Emps] Scan [Depts]
Join [e.deptno = d.deptno]
Filter [e.age < 30]
Aggregate [deptno, COUNT(*) AS c]
Filter [c > 5]
Project [name, c]
Sort [c DESC]
SELECT d.name, COUNT(*) AS c
FROM (SELECT * FROM Emps
WHERE e.age < 40) AS e
JOIN Depts AS d USING (deptno)
GROUP BY d.deptno
HAVING COUNT(*) > 5
ORDER BY c DESC
Algebraic rewrite
Optimize by applying rewrite rules that
preserve semantics
Hopefully the result is less expensive;
but it’s OK if it’s not (planner keeps
“before” and “after”)
Planner uses dynamic programming,
seeking the lowest total cost
Scan [Emps] Scan [Depts]
Join [e.deptno = d.deptno]
Filter [e.age < 30]
Aggregate [deptno, COUNT(*) AS c]
Filter [c > 5]
Project [name, c]
Sort [c DESC]
Relational Spatial
A spatial query
Find all restaurants within 1.5 distance units of
my location (6, 7)
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
•
•
•
•
Zachary’s
pizza
Filippo’s
King
Yen
Station
burger
A spatial query
Find all restaurants within 1.5 distance units of
my location (6, 7)
Using OpenGIS SQL extensions:
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
SELECT *
FROM Restaurants AS r
WHERE ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
•
•
•
•
Zachary’s
pizza
Filippo’s
King
Yen
Station
burger
Simple implementation
Using ESRI’s geometry-api-java library,
almost all ST_ functions were easy to
implement in Calcite.
Slow – one row at a time.
SELECT *
FROM Restaurants AS r
WHERE ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
package org.apache.calcite.runtime;
import com.esri.core.geometry.*;
/** Simple implementations of built-in geospatial functions. */
public class GeoFunctions {
/** Returns the distance between g1 and g2. */
public static double ST_Distance(Geom g1, Geom g2) {
return GeometryEngine.distance(g1.g(), g2.g(), g1.sr());
}
/** Constructs a 2D point from coordinates. */
public static Geom ST_MakePoint(double x, double y) {
final Geometry g = new Point(x, y);
return new SimpleGeom(g);
}
/** Geometry. It may or may not have a spatial reference
* associated with it. */
public interface Geom {
Geometry g();
SpatialReference sr();
Geom transform(int srid);
Geom wrap(Geometry g);
}
static class SimpleGeom implements Geom { … }
}
Traditional DB indexing
techniques don’t work
Sort
Hash
CREATE /* b-tree */ INDEX
I_Restaurants
ON Restaurants(x, y);
CREATE TABLE Restaurants(
restaurant VARCHAR(20),
x INTEGER,
y INTEGER)
PARTITION BY (MOD(x + 5279 * y, 1024));
•
•
•
•
A scan over a two-dimensional index only
has locality in one dimension
A “vanilla database”
Master
Region
server
[A - Gh]
Region
server
[Gi - Ts]
Region
server
[Tr - Z]
Spatial data structures and algorithms
The challenge: Reduce dimensionality while preserving locality
● Reduce dimensionality – We want to warp the information space so that
we can access on one composite attribute rather than several
● Preserve locality – If two items are close in 2D, we want them to be close
in the information space (and in the same cache line or disk block)
Two main approaches to spatial data structures:
● Data-oriented
● Space-oriented
R-tree (a data-oriented structure)
R-tree (split vertically into 2)
R-tree (split horizontally into 4)
R-tree (split vertically into 8)
R-tree (split horizontally into 16)
R-tree (split vertically into 32)
Grid (a space-oriented structure)
Grid (a space-oriented structure)
Spatial query
Find all restaurants within 1.5 distance units of
where I am:
restaurant x y
Zachary’s pizza 3 1
King Yen 7 7
Filippo’s 7 4
Station burger 5 6
SELECT *
FROM Restaurants AS r
WHERE ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
•
•
•
•
Zachary’s
pizza
Filippo’s
King
Yen
Station
burger
Hilbert space-filling curve
● A space-filling curve invented by mathematician David Hilbert
● Every (x, y) point has a unique position on the curve
● Points near to each other typically have Hilbert indexes close together
•
•
•
•
Add restriction based on h, a restaurant’s distance
along the Hilbert curve
Must keep original restriction due to false positives
Using Hilbert index
restaurant x y h
Zachary’s pizza 3 1 5
King Yen 7 7 41
Filippo’s 7 4 52
Station burger 5 6 36
Zachary’s
pizza
Filippo’s
SELECT *
FROM Restaurants AS r
WHERE (r.h BETWEEN 35 AND 42
OR r.h BETWEEN 46 AND 46)
AND ST_Distance(
ST_MakePoint(r.x, r.y),
ST_MakePoint(6, 7)) < 1.5
King
Yen
Station
burger
Telling the optimizer
1. Declare h as a generated column
2. Sort table by h
Planner can now convert spatial range
queries into a range scan
Does not require specialized spatial
index such as r-tree
Very efficient on a sorted table such as
HBase
CREATE TABLE Restaurants (
restaurant VARCHAR(20),
x DOUBLE,
y DOUBLE,
h DOUBLE GENERATED ALWAYS AS
ST_Hilbert(x, y) STORED)
SORT KEY (h);
restaurant x y h
Zachary’s pizza 3 1 5
Station burger 5 6 36
King Yen 7 7 41
Filippo’s 7 4 52
Algebraic rewrite
Scan [T]
Filter [ST_Distance(
ST_Point(T.X, T.Y),
ST_Point(x, y)) < d]
Scan [T]
Filter [(T.H BETWEEN h0 AND h1
OR T.H BETWEEN h2 AND h3)
AND ST_Distance(
ST_Point(T.X, T.Y),
ST_Point(x, y)) < d]
Constraint: Table T has a
column H such that:
H = Hilbert(X, Y)
FilterHilbertRule
x, y, d, hi
– constants
T – table
T.X, T.Y, T.H – columns
Variations on a theme
Several ways to say the same thing using OpenGIS functions:
● ST_Distance(ST_Point(X, Y), ST_Point(x, y)) < d
● ST_Distance(ST_Point(x, y), ST_Point(X, Y)) < d
● ST_DWithin(ST_Point(x, y), ST_Point(X, Y), d)
● ST_Contains(ST_Buffer(ST_Point(x, y), d), ST_Point(X, Y))
Other patterns can use Hilbert functions:
● ST_DWithin(ST_MakeLine(ST_Point(x1, y1), ST_Point(x2, y2)),
ST_Point(X, Y), d)
● ST_Contains(ST_PolyFromText('POLYGON((0 0,20 0,20 20,0 20,0 0))'),
ST_Point(X, Y), d)
More spatial queries
What state am I in? (1-point-to-1-polygon)
Which states does Yellowstone NP intersect?
(1-polygon-to-many-polygons)
Which US national park intersects with the most
states? (many-polygons-to-many-polygons,
followed by sort/limit)
More spatial queries
What state am I in? (point-to-polygon)
Which states does Yellowstone NP intersect?
(polygon-to-polygon)
SELECT *
FROM States AS s
WHERE ST_Intersects(s.geometry,
ST_MakePoint(6, 7))
SELECT *
FROM States AS s
WHERE ST_Intersects(s.geometry,
ST_GeomFromText('LINESTRING(...)'))
Tile index
Idaho
Montana
Nevada Utah Colorado
Wyoming
We cannot use space-filling curves, because
each region (state or park) is a set of points and
not known as planning time.
Divide regions into a (coarse) set of tiles. They
intersect only if some of their tiles intersect.
Tile index 6 7 8
3 4 5
0 1 2
Idaho
Montana
Nevada Utah Colorado
Wyoming
tileId state
0 Nevada
0 Utah
1 Utah
2 Colorado
2 Utah
3 Idaho
3 Nevada
3 Utah
4 Idaho
tileId state
4 Utah
4 Wyoming
5 Wyoming
6 Idaho
6 Montana
7 Montana
7 Wyoming
8 Montana
8 Wyoming
Aside: Materialized views
CREATE MATERIALIZED
VIEW EmpSummary AS
SELECT deptno, COUNT(*) AS c
FROM Emp
GROUP BY deptno;
Scan [Emps]
Scan
[EmpSummary]
Aggregate [deptno, count(*)]
empno name deptno
100 Fred 20
110 Barney 10
120 Wilma 30
130 Dino 10
deptno c
10 2
20 1
30 1
A materialized view is a table
that is defined by a query
The planner knows about the
mapping and can
transparently rewrite queries
to use it
Building the tile index
Use the ST_MakeGrid function to decompose each
state into a series of tiles
Store the results in a table, sorted by tile id
A materialized view is a table that remembers how
it was computed, so the planner can rewrite queries
to use it
CREATE MATERIALIZED VIEW StateTiles AS
SELECT s.stateId, t.tileId
FROM States AS s,
LATERAL TABLE(ST_MakeGrid(s.geometry, 4, 4)) AS t
6 7 8
3 4 5
0 1 2
Idaho
Montana
Nevada Utah Colorado
Wyoming
Point-to-polygon query
What state am I in? (point-to-polygon)
1. Divide the plane into tiles, and pre-compute
the state-tile intersections
2. Use this ‘tile index’ to narrow list of states
SELECT s.*
FROM States AS s
WHERE s.stateId IN (SELECT stateId
FROM StateTiles AS t
WHERE t.tileId = 8)
AND ST_Intersects(s.geometry, ST_MakePoint(6, 7))
6 7 8
3 4 5
0 1 2
Idaho
Montana
Nevada Utah Colorado
Wyoming
Algebraic rewrite
Scan [S]
Filter [ST_Intersects(
S.geometry,
ST_Point(x, y)]
SemiJoin [S.stateId = T.stateId]
Constraint #1: There is a table “Tiles” defined by
SELECT s.stateId, t.tileId FROM States AS s,
LATERAL TABLE(ST_MakeGrid(s.geometry, x, y)) AS t
Constraint #2: stateId is primary key of S
TileSemiJoinRule
Filter [ST_Intersects(
S.geometry,
ST_Point(x, y)]
Scan [S] Filter [T.tileId = 8]
Scan [T]
Streaming + spatial
Example query: Every minute, emit the number of journeys that have intersected
each city. (Some journeys intersect multiple cities.)
(Efficient implementation is left as an exercise to the reader. Probably involves
splitting journeys into tiles, partitioning by tile hash-code, intersecting with cities
in those tiles, then rolling up cities.)
SELECT STREAM c.name, COUNT(*)
FROM Journeys AS j
CROSS JOIN Cities AS c
ON ST_Intersects(c.geometry, j.geometry)
GROUP BY c.name, FLOOR(j.rowtime TO HOUR)
Summary
Traditional DB techniques (sort, hash) don’t work for 2-dimensional data
Spatial presents tough design choices:
● Space-oriented vs data-oriented algorithms
● General-purpose vs specialized data structures
Relational algebra unifies traditional and spatial:
● Use general-purpose structures
● Compose techniques (transactions, analytics, spatial, streaming)
● Must use space-oriented algorithms, because their dimensionality-reducing
mapping is known at planning time
Thank you! Questions?
@ApacheCalcite | @julianhyde | https://calcite.apache.org
Resources & credits
● [CALCITE-1616] Data profiler
● [CALCITE-1870] Lattice suggester
● [CALCITE-1861] Spatial indexes
● [CALCITE-1968] OpenGIS
● [CALCITE-1991] Generated columns
● Talk: “Data profiling with Apache Calcite” (Hadoop Summit, 2017)
● Talk: “SQL on everything, in memory” (Strata, 2014)
● Zhang, Qi, Stradling, Huang (2014). “Towards a Painless Index for Spatial
Objects”
● Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes
efficiently”
● https://www.census.gov/geo/maps-data/maps/2000popdistribution.html
● https://www.nasa.gov/mission_pages/NPP/news/earth-at-night.html
Extra slides
Architecture
Conventional database Calcite
Planning queries
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Table: splunk
Optimized query
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Calcite framework
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• FilterMergeRule
• AggregateUnionTransposeRule
• 100+ more
Global transformations
• Unification (materialized view)
• Column trimming
• De-correlation
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• RelDistribution (partitioning)
RelBuilder
JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Lattice

Contenu connexe

Tendances

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat SheetHortonworks
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Julian Hyde
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
All you need to know about CREATE STATISTICS
All you need to know about CREATE STATISTICSAll you need to know about CREATE STATISTICS
All you need to know about CREATE STATISTICSEDB
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON Padma shree. T
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamZheng Shao
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 

Tendances (20)

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
All you need to know about CREATE STATISTICS
All you need to know about CREATE STATISTICSAll you need to know about CREATE STATISTICS
All you need to know about CREATE STATISTICS
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 

Similaire à Spatial query on vanilla databases

U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita
 
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftBest Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumertirlukachaitanya
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Aman Sinha
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
Advanced SQL For Data Scientists
Advanced SQL For Data ScientistsAdvanced SQL For Data Scientists
Advanced SQL For Data ScientistsDatabricks
 
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...EDB
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
Love Your Database Railsconf 2017
Love Your Database Railsconf 2017Love Your Database Railsconf 2017
Love Your Database Railsconf 2017gisborne
 
U-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance TuningU-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance TuningMichael Rys
 
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014Dave Stokes
 

Similaire à Spatial query on vanilla databases (20)

U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftBest Practices for Migrating Legacy Data Warehouses into Amazon Redshift
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
Advanced SQL For Data Scientists
Advanced SQL For Data ScientistsAdvanced SQL For Data Scientists
Advanced SQL For Data Scientists
 
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Love Your Database Railsconf 2017
Love Your Database Railsconf 2017Love Your Database Railsconf 2017
Love Your Database Railsconf 2017
 
Hive
HiveHive
Hive
 
U-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance TuningU-SQL Query Execution and Performance Tuning
U-SQL Query Execution and Performance Tuning
 
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
MySQL Query Tuning for the Squeemish -- Fossetcon Orlando Sep 2014
 

Plus de Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 

Plus de Julian Hyde (15)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 

Dernier

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Dernier (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Spatial query on vanilla databases

  • 1. SPATIAL QUERY ON VANILLA DATABASES Julian Hyde (Calcite PMC)
  • 2. Spatial Query on Vanilla Databases Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them. In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
  • 3. @julianhyde SQL Query planning Query federation BI & OLAP Streaming Hadoop ASF member Original author of Apache Calcite PMC Apache Arrow, Calcite, Drill, Eagle, Kylin Architect at Looker
  • 4.
  • 5. Apache Calcite Apache top-level project since 2015 Query planning framework used in many projects and products Also works standalone: embedded federated query engine with SQL / JDBC front end Apache community development model https://calcite.apache.org https://github.com/apache/calcite
  • 6. SELECT d.name, COUNT(*) AS c FROM Emps AS e JOIN Depts AS d USING (deptno) WHERE e.age < 40 GROUP BY d.deptno HAVING COUNT(*) > 5 ORDER BY c DESC Relational algebra Based on set theory, plus operators: Project, Filter, Aggregate, Union, Join, Sort Requires: declarative language (SQL), query planner Original goal: data independence Enables: query optimization, new algorithms and data structures Scan [Emps] Scan [Depts] Join [e.deptno = d.deptno] Filter [e.age < 30] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC]
  • 7. SELECT d.name, COUNT(*) AS c FROM (SELECT * FROM Emps WHERE e.age < 40) AS e JOIN Depts AS d USING (deptno) GROUP BY d.deptno HAVING COUNT(*) > 5 ORDER BY c DESC Algebraic rewrite Optimize by applying rewrite rules that preserve semantics Hopefully the result is less expensive; but it’s OK if it’s not (planner keeps “before” and “after”) Planner uses dynamic programming, seeking the lowest total cost Scan [Emps] Scan [Depts] Join [e.deptno = d.deptno] Filter [e.age < 30] Aggregate [deptno, COUNT(*) AS c] Filter [c > 5] Project [name, c] Sort [c DESC]
  • 9. A spatial query Find all restaurants within 1.5 distance units of my location (6, 7) restaurant x y Zachary’s pizza 3 1 King Yen 7 7 Filippo’s 7 4 Station burger 5 6 • • • • Zachary’s pizza Filippo’s King Yen Station burger
  • 10. A spatial query Find all restaurants within 1.5 distance units of my location (6, 7) Using OpenGIS SQL extensions: restaurant x y Zachary’s pizza 3 1 King Yen 7 7 Filippo’s 7 4 Station burger 5 6 SELECT * FROM Restaurants AS r WHERE ST_Distance( ST_MakePoint(r.x, r.y), ST_MakePoint(6, 7)) < 1.5 • • • • Zachary’s pizza Filippo’s King Yen Station burger
  • 11. Simple implementation Using ESRI’s geometry-api-java library, almost all ST_ functions were easy to implement in Calcite. Slow – one row at a time. SELECT * FROM Restaurants AS r WHERE ST_Distance( ST_MakePoint(r.x, r.y), ST_MakePoint(6, 7)) < 1.5 package org.apache.calcite.runtime; import com.esri.core.geometry.*; /** Simple implementations of built-in geospatial functions. */ public class GeoFunctions { /** Returns the distance between g1 and g2. */ public static double ST_Distance(Geom g1, Geom g2) { return GeometryEngine.distance(g1.g(), g2.g(), g1.sr()); } /** Constructs a 2D point from coordinates. */ public static Geom ST_MakePoint(double x, double y) { final Geometry g = new Point(x, y); return new SimpleGeom(g); } /** Geometry. It may or may not have a spatial reference * associated with it. */ public interface Geom { Geometry g(); SpatialReference sr(); Geom transform(int srid); Geom wrap(Geometry g); } static class SimpleGeom implements Geom { … } }
  • 12. Traditional DB indexing techniques don’t work Sort Hash CREATE /* b-tree */ INDEX I_Restaurants ON Restaurants(x, y); CREATE TABLE Restaurants( restaurant VARCHAR(20), x INTEGER, y INTEGER) PARTITION BY (MOD(x + 5279 * y, 1024)); • • • • A scan over a two-dimensional index only has locality in one dimension
  • 13. A “vanilla database” Master Region server [A - Gh] Region server [Gi - Ts] Region server [Tr - Z]
  • 14. Spatial data structures and algorithms The challenge: Reduce dimensionality while preserving locality ● Reduce dimensionality – We want to warp the information space so that we can access on one composite attribute rather than several ● Preserve locality – If two items are close in 2D, we want them to be close in the information space (and in the same cache line or disk block) Two main approaches to spatial data structures: ● Data-oriented ● Space-oriented
  • 15.
  • 16.
  • 17.
  • 26. Spatial query Find all restaurants within 1.5 distance units of where I am: restaurant x y Zachary’s pizza 3 1 King Yen 7 7 Filippo’s 7 4 Station burger 5 6 SELECT * FROM Restaurants AS r WHERE ST_Distance( ST_MakePoint(r.x, r.y), ST_MakePoint(6, 7)) < 1.5 • • • • Zachary’s pizza Filippo’s King Yen Station burger
  • 27. Hilbert space-filling curve ● A space-filling curve invented by mathematician David Hilbert ● Every (x, y) point has a unique position on the curve ● Points near to each other typically have Hilbert indexes close together
  • 28. • • • • Add restriction based on h, a restaurant’s distance along the Hilbert curve Must keep original restriction due to false positives Using Hilbert index restaurant x y h Zachary’s pizza 3 1 5 King Yen 7 7 41 Filippo’s 7 4 52 Station burger 5 6 36 Zachary’s pizza Filippo’s SELECT * FROM Restaurants AS r WHERE (r.h BETWEEN 35 AND 42 OR r.h BETWEEN 46 AND 46) AND ST_Distance( ST_MakePoint(r.x, r.y), ST_MakePoint(6, 7)) < 1.5 King Yen Station burger
  • 29. Telling the optimizer 1. Declare h as a generated column 2. Sort table by h Planner can now convert spatial range queries into a range scan Does not require specialized spatial index such as r-tree Very efficient on a sorted table such as HBase CREATE TABLE Restaurants ( restaurant VARCHAR(20), x DOUBLE, y DOUBLE, h DOUBLE GENERATED ALWAYS AS ST_Hilbert(x, y) STORED) SORT KEY (h); restaurant x y h Zachary’s pizza 3 1 5 Station burger 5 6 36 King Yen 7 7 41 Filippo’s 7 4 52
  • 30. Algebraic rewrite Scan [T] Filter [ST_Distance( ST_Point(T.X, T.Y), ST_Point(x, y)) < d] Scan [T] Filter [(T.H BETWEEN h0 AND h1 OR T.H BETWEEN h2 AND h3) AND ST_Distance( ST_Point(T.X, T.Y), ST_Point(x, y)) < d] Constraint: Table T has a column H such that: H = Hilbert(X, Y) FilterHilbertRule x, y, d, hi – constants T – table T.X, T.Y, T.H – columns
  • 31. Variations on a theme Several ways to say the same thing using OpenGIS functions: ● ST_Distance(ST_Point(X, Y), ST_Point(x, y)) < d ● ST_Distance(ST_Point(x, y), ST_Point(X, Y)) < d ● ST_DWithin(ST_Point(x, y), ST_Point(X, Y), d) ● ST_Contains(ST_Buffer(ST_Point(x, y), d), ST_Point(X, Y)) Other patterns can use Hilbert functions: ● ST_DWithin(ST_MakeLine(ST_Point(x1, y1), ST_Point(x2, y2)), ST_Point(X, Y), d) ● ST_Contains(ST_PolyFromText('POLYGON((0 0,20 0,20 20,0 20,0 0))'), ST_Point(X, Y), d)
  • 32. More spatial queries What state am I in? (1-point-to-1-polygon) Which states does Yellowstone NP intersect? (1-polygon-to-many-polygons) Which US national park intersects with the most states? (many-polygons-to-many-polygons, followed by sort/limit)
  • 33. More spatial queries What state am I in? (point-to-polygon) Which states does Yellowstone NP intersect? (polygon-to-polygon) SELECT * FROM States AS s WHERE ST_Intersects(s.geometry, ST_MakePoint(6, 7)) SELECT * FROM States AS s WHERE ST_Intersects(s.geometry, ST_GeomFromText('LINESTRING(...)'))
  • 34. Tile index Idaho Montana Nevada Utah Colorado Wyoming We cannot use space-filling curves, because each region (state or park) is a set of points and not known as planning time. Divide regions into a (coarse) set of tiles. They intersect only if some of their tiles intersect.
  • 35. Tile index 6 7 8 3 4 5 0 1 2 Idaho Montana Nevada Utah Colorado Wyoming tileId state 0 Nevada 0 Utah 1 Utah 2 Colorado 2 Utah 3 Idaho 3 Nevada 3 Utah 4 Idaho tileId state 4 Utah 4 Wyoming 5 Wyoming 6 Idaho 6 Montana 7 Montana 7 Wyoming 8 Montana 8 Wyoming
  • 36. Aside: Materialized views CREATE MATERIALIZED VIEW EmpSummary AS SELECT deptno, COUNT(*) AS c FROM Emp GROUP BY deptno; Scan [Emps] Scan [EmpSummary] Aggregate [deptno, count(*)] empno name deptno 100 Fred 20 110 Barney 10 120 Wilma 30 130 Dino 10 deptno c 10 2 20 1 30 1 A materialized view is a table that is defined by a query The planner knows about the mapping and can transparently rewrite queries to use it
  • 37. Building the tile index Use the ST_MakeGrid function to decompose each state into a series of tiles Store the results in a table, sorted by tile id A materialized view is a table that remembers how it was computed, so the planner can rewrite queries to use it CREATE MATERIALIZED VIEW StateTiles AS SELECT s.stateId, t.tileId FROM States AS s, LATERAL TABLE(ST_MakeGrid(s.geometry, 4, 4)) AS t 6 7 8 3 4 5 0 1 2 Idaho Montana Nevada Utah Colorado Wyoming
  • 38. Point-to-polygon query What state am I in? (point-to-polygon) 1. Divide the plane into tiles, and pre-compute the state-tile intersections 2. Use this ‘tile index’ to narrow list of states SELECT s.* FROM States AS s WHERE s.stateId IN (SELECT stateId FROM StateTiles AS t WHERE t.tileId = 8) AND ST_Intersects(s.geometry, ST_MakePoint(6, 7)) 6 7 8 3 4 5 0 1 2 Idaho Montana Nevada Utah Colorado Wyoming
  • 39. Algebraic rewrite Scan [S] Filter [ST_Intersects( S.geometry, ST_Point(x, y)] SemiJoin [S.stateId = T.stateId] Constraint #1: There is a table “Tiles” defined by SELECT s.stateId, t.tileId FROM States AS s, LATERAL TABLE(ST_MakeGrid(s.geometry, x, y)) AS t Constraint #2: stateId is primary key of S TileSemiJoinRule Filter [ST_Intersects( S.geometry, ST_Point(x, y)] Scan [S] Filter [T.tileId = 8] Scan [T]
  • 40. Streaming + spatial Example query: Every minute, emit the number of journeys that have intersected each city. (Some journeys intersect multiple cities.) (Efficient implementation is left as an exercise to the reader. Probably involves splitting journeys into tiles, partitioning by tile hash-code, intersecting with cities in those tiles, then rolling up cities.) SELECT STREAM c.name, COUNT(*) FROM Journeys AS j CROSS JOIN Cities AS c ON ST_Intersects(c.geometry, j.geometry) GROUP BY c.name, FLOOR(j.rowtime TO HOUR)
  • 41. Summary Traditional DB techniques (sort, hash) don’t work for 2-dimensional data Spatial presents tough design choices: ● Space-oriented vs data-oriented algorithms ● General-purpose vs specialized data structures Relational algebra unifies traditional and spatial: ● Use general-purpose structures ● Compose techniques (transactions, analytics, spatial, streaming) ● Must use space-oriented algorithms, because their dimensionality-reducing mapping is known at planning time
  • 42. Thank you! Questions? @ApacheCalcite | @julianhyde | https://calcite.apache.org Resources & credits ● [CALCITE-1616] Data profiler ● [CALCITE-1870] Lattice suggester ● [CALCITE-1861] Spatial indexes ● [CALCITE-1968] OpenGIS ● [CALCITE-1991] Generated columns ● Talk: “Data profiling with Apache Calcite” (Hadoop Summit, 2017) ● Talk: “SQL on everything, in memory” (Strata, 2014) ● Zhang, Qi, Stradling, Huang (2014). “Towards a Painless Index for Spatial Objects” ● Harinarayan, Rajaraman, Ullman (1996). “Implementing data cubes efficiently” ● https://www.census.gov/geo/maps-data/maps/2000popdistribution.html ● https://www.nasa.gov/mission_pages/NPP/news/earth-at-night.html
  • 43.
  • 46. Planning queries MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Table: splunk
  • 47. Optimized query MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 48. Calcite framework Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • FilterMergeRule • AggregateUnionTransposeRule • 100+ more Global transformations • Unification (materialized view) • Column trimming • De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • RelDistribution (partitioning) RelBuilder JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice