SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Ioana Delaney, Suresh Thalamati
Spark Technology Center, IBM
Informational Referential
Integrity Constraints Support in
Apache Spark
About the Speakers
• Ioana Delaney
– Spark Technology Center, IBM
– DB2 Optimizer developer working in the areas of query semantics, rewrite, and
optimizer.
– Worked on various releases of DB2 LUW and DB2 with BLU Acceleration
– Apache Spark SQL Contributor
• Suresh Thalamati
– Spark Technology Center, IBM
– Apache Derby Committer and PMC Member, Apache Spark Contributor
– Worked on various releases of IBM BigInsights, Apache Derby, Informix
Database.
IBM Spark Technology Center
• Founded in 2015
• Location: 505 Howard St., San Francisco
• Web: http://spark.tc
• Twitter: @apachespark_tc
• Mission:
– Contribute intellectual and technical capital to
the Apache Spark community.
– Make the core technology enterprise and
cloud-ready.
– Build data science skills to drive intelligence
into business applications
http://bigdatauniversity.com
Motivation
• Open up an area of query optimization techniques that rely on referential
integrity (RI) constraints semantics
• Support for informational primary key and foreign key (referential integrity)
constraints
• Not enforced by the Spark SQL engine; rather used by Catalyst to optimize
the query processing
• Targeted to applications that load and analyze data that originated from a
Data Warehouse for which the conditions for a given constraint are known
to be true
• Improvement of up to 8x for some of the TPC-DS queries
What is Data Warehouse?
• Relational database that integrates data from multiple heterogeneous sources e.g.
transactional data, files, other sources
• Designed for data modelling and analysis
• Provides information around a subject of a business e.g. product, customers,
suppliers, etc.
• The most important requirements are query performance and data simplicity
• Based on a dimensional, or Star Schema model
– Consists of a fact table referencing a number of dimension tables
– Fact table contains the main data, or measurements, of a business
– Dimension tables, usually smaller tables, describe the different characteristics, or
dimensions, of a business
TPC-DS Benchmark
• Proxy of a real organization data warehouse
• De-facto industry standard benchmark for measuring the performance of decision
support solutions such as RDBMS and Hadoop/Spark based systems
• The underlying business model is a retail product supplier e.g. retail sales, web,
catalog data, inventory, demographics, etc
• Examines large volumes of data e.g. 1TB to 100TB
• Executes SQL queries of various operational requirements and complexities e.g.
ad-hoc, reporting, data mining
Excerpt from store_sales fact table diagram:
Integrity Constraints in Data Warehouse
• Typical constraints:
– Unique
– Not Null
– Primary key and foreign key (referential integrity)
• Used for:
– Data cleanliness
– Query optimizations
§ Constraint states:
§ Enforced
§ Validated
§ Informational
• Implement powerful optimizations based on RI semantics e.g. Join Elimination
• Example using a typical user scenario: queries against views
create view customer_purchases_2002 (id, last, first, product, store_id, month, quantity) as
select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, d_moy, ss_quantity
from store_sales, date_dim, customer, item, store
where d_date_sk = ss_sold_date_sk and
c_customer_sk = ss_customer_sk and
i_item_sk = ss_item_sk and
s_store_sk = ss_store_sk and
d_year = 2002
select id, first, last, product, quantity
from customer_purchases _2002
where product like ‘bicycle%’ and
month between 1 and 2
User view: User query:
How do Optimizers use RI Constraints?
Internal optimizer query processing:
Selects only a subset
of columns from view
Join between store and
store_sales removed
based on RI analysis
Let’s make Spark do this too!
• Introduce query optimization techniques in Spark that rely on RI
semantics
§ Star schema detection/Star-join optimizations
§ Existential subquery to Inner join transformation
§ Group by push down through joins
§ Many others
About Catalyst
• Apache Spark’s Optimizer
• Queries are represented internally by trees of operators e.g. logical trees and physical trees
• Trees are manipulated by rules
• Each compiler phase applies a different set of rules
• For example, in the Logical Plan Optimization phase:
– Rules rewrite logical plans into semantically equivalent ones for better performance
– Rules perform natural heuristics:
• e.g. merge query blocks, push down predicates, remove unreferenced columns, etc.
– Earlier rules enable later rules
• e.g. merge query blocks à globaljoin reorder
Star Schema Optimizations
• Queries against star schema are expected to run fast based on RI
relationships among the tables
• In a query, star schema detection algorithm:
– Observes RI relationships based on the join predicates
– Finds the tables connected in a star-join
– Lets the Optimizer plan the star-join tables in an optimal way
• SPARK-17791 implements star schema detection based on heuristics
• Instead, use RI information to make the algorithm more robust
Star Schema Optimizations
Executionplan transformation:
select i_item_id, s_store_id, avg(ss_net_profit)as store_sales_profit,
avg(sr_net_loss) as store_returns_loss
from
store_sales, store_returns, date_dim d1, date_dim d2, store, item
where
i_item_sk = ss_item_sk and
s_store_sk = ss_store_sk and
ss_customer_sk = sr_customer_sk and
ss_item_sk = sr_item_sk and
ss_ticket_number = sr_ticket_number and
sr_returned_date_sk = d2.d_date_sk and
d1.d_moy = 4 and d1.d_year = 1998 and. . .
group by i_item_id, s_store_id
order by i_item_id, s_store_id
Simplified TPC-DS Query 25
Star schema diagram:
item
store
date_dim
store_sales store_returns
date_dim
N:1
1 : N1: N
1:N
1:N
• Query execution drops from 421 secs to 147
secs (1TB TPC-DS setup), ~ 3x improvement
Star Schema Optimizations
• TPC-DS query speedup: 2x – 8x
• By observing RI relationships among the
tables, Optimizer makes better planning
decisions
• Reduce the data early in the execution plan
• Reduce, or eliminate Sort Merge joins in
favor of more efficient Broadcast Hash joins
TPC-DS 1TB performance results with star schema detection:
Existential Subquery to Inner Join
• Applied to certain types of EXISTS/IN subqueries
• Spark uses Left Semi-join
– Returns a row of the outer if there is at least one match in the inner
– Imposes a certain order of join execution
• Instead, use more flexible Inner joins
– If the subquery produces at most one row, or by introducing a Distinct on the outer table’s
primary key to remove the duplicate rows after the join
– Allows subquery tables to be merged into the outer query block
Ø Enable global join reorder, star-schema detection, other rewrites
EXISTS with
correlated subquery
EXISTS à INNER JOIN
Existential Subquery to Inner Join
• Query execution drops from 167 secs to 22 secs
(1TB TPC-DS setup), 7x improvement
Execution plan transformation:
select cd_gender, cd_marital_status, cd_education_status, count(*) cnt1, ...
from customer c, customer_address ca, customer_demographics
where
c.c_current_addr_sk = ca.ca_address_sk and
ca_county in ('Woods County','Madison County') and
cd_demo_sk = c.c_current_cdemo_sk and
EXISTS (select *
from store_sales, date_dim
where c.c_customer_sk = ss_customer_sk and
ss_sold_date_sk = d_date_sk and
d_year = 2002 and d_moy between 4 and 4+3)
group by cd_gender, cd_marital_status, cd_education_status, …
order by cd_gender, cd_marital_status, cd_education_status, …
limit 100
Simplified TPC-DS query 10
Star schema diagram:
customer
date_dim
store_sales
customer_address
customer_demographic
N: 1
1:N
1:N
N : 1
Existential Subquery to Inner Join
• Inner joins provide better alternative plan
choices for the Optimizer
– Avoids joins with large inners, and thus
favors Broadcast Hash Joins
• Tables in the subquery are merged into
the outer query block
– Enables other rewrites
– Benefits from star schema detection
• May introduce a Distinct/HashAggregate
TPC-DS 1TB performance results with subquery to join:
Group By Push Down Through Join
• In general, applied based on the cost/selectivity estimates
• If the join is an RI join, heuristically push down Group By to the fact table
– The input to the Group By remains the same before and after the join
– The input to the join is reduced
– Overall reduction of the execution time
Aggregate functions
on fact table columns
Grouping columns
is a superset of join
columns
PK – FK joins
Group By Push Down Through Join
Execution plan transformation:
• Query execution drops from 70 secs to 30 secs (1TB
TPC-DS setup), 2x improvement
select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name,
sum(ss.ss_quantity) as store_sales_quantity
from store_sales ss, date_dim, customer, store
where d_date_sk = ss_sold_date_sk and
c_customer_sk = ss_customer_sk and
s_store_sk = ss_store_sk and
d_year between 2000and 2002
group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name
order by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name
limit 100;
Star schema:
customer
date_dim
store_sales
store
N : 1
1:NN:1
More Optimizations
• RI join elimination
– Eliminates dimension tables if none of their columns, other than the PK columns,
are referenced in the query
• Redundant join elimination
– Eliminates self-joins on unique keys; self-joins may be introduced after view
expansion
• Distinct elimination
– Distinct can be removed if it is proved that the operation produces unique output
• Proving maximal cardinality
– Used by query rewrite
Informational RI Implementation in Spark
• DDL Support for constraint lifecycle
• Metadata Storage
• Constraint Maintenance
DDL Support
• Support Informational Primary Key and Foreign Key constraint
• New DDLs for create, alter, drop constraint
• Create Constraint:
– Constraints can be added as part of CREATE TABLE and ALTER TABLE
– Create table DDL supports both inline and out_of_line syntax
• Syntax is similar to Hive 2.1.0 DDL
CREATE TABLE customer (id int CONSTRAINT pk1 PRIMARY KEY RELY,
name string, address string, demo int)
CREATE TABLE customer_demographics(id int, gender string, credit_rating string,
CONSTRAINT pk1 PRIMARY KEY (id) RELY);
ALTER TABLE customer
ADD CONSTRAINT fk1 FOREIGN KEY (demo) REFERENCES customer_demographics (id) VALIDATE RELY;
Constraint States and Options
• Two states: VALIDATE and NOVALIDATE
– Specify if the existing data conforms to the constraint
• Two options: RELY and NORELY
– Specify whether a constraint is to be taken into account for query rewrite
• Used by Catalyst as follows:
– NORELY: constraint cannot be used by Catalyst
– NOVALIDATE RELY: constraint used for join order optimizations
– VALIDATE RELY: constraint used for query rewrite and join order optimizations
• Constraints are not enforced, they are informational only.
• Namespace for a constraint is at the table level
Metadata Storage
• Constraint definitions are stored in the table properties
• Constraint information is stored in JSON format
• Describe formatted includes constraint information
# ConstraintInformation
Primary Key:
ConstraintName: pk1
Key Columns: [id]
Foreign Keys:
ConstraintName: fk1
Key Columns: [demo]
Reference Table: customer_demographics
Reference Columns:[id]
spark.sql.constraint={"id":"fk1","type":"fk","keyCols”["demo"],"referenceTable":"customer_demographics",
"referenceCols":["id"],"rely":true,"validate":true}
Constraint Maintenance
• Dropping a constraint
• Validating Constraints
– Internally Spark SQL queries are run to perform the validation.
• Data cache invalidation
– Entries in the cache that reference the altered table are invalidated
ALTER TABLE customer_demographics DROP CONSTRAINTpk1 CASCADE
DROP TABLE IF EXISTS customer_demographicsCASCADE CONSTRAINT
ALTER TABLE customer VALIDATE CONSTRAINT pk1
1TB TPC-DS Performance Results
Cluster: 4-node cluster, each node having:
12 2 TB disks,
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM, 10 Gigabit Ethernet
Number of cores: 48
Apache Hadoop 2.7.3, Apache Spark 2.2 main ( Mar 31, 2017)
Database info:
Schema: TPCDS
Scale factor: 1TB total space
Storage format: Parquet with Snappy compression
TPC-DS
Query
spark-2.2
(secs)
spark-2.2-ri
(secs)
Speedup
Q06 106 19 5x
Q10 355 190 2x
Q13 296 98 3x
Q15 147 17 8x
Q16 1394 706 2x
Q17 398 146 2x
Q24 485 249 2x
Q25 421 147 2x
Q29 380 126 3x
Q35 462 285 1.5x
Q45 93 17 5x
Q69 327 173 1.8x
Q74 237 119 2x
Q85 104 42 2x
Q94 603 307 2x
Call to Action
• Read the Spec in SPARK-19842
https://issues.apache.org/jira/browse/SPARK-19842
• Prototype code is under internal review
• Watch out for the upcoming PRs
• Meet us at the IBM demo booth #407
Thank You.
Ioana Delaney ursu@us.ibm.com
Suresh Thalamati tsuresh@us.ibm.com
Visit http://spark.tc

Contenu connexe

Tendances

Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...Splunk
 
State of the Dolphin - May 2022
State of the Dolphin - May 2022State of the Dolphin - May 2022
State of the Dolphin - May 2022Frederic Descamps
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack PresentationAmr Alaa Yassen
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Node.js and the MySQL Document Store
Node.js and the MySQL Document StoreNode.js and the MySQL Document Store
Node.js and the MySQL Document StoreRui Quelhas
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 

Tendances (20)

Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...
Splunk conf2014 - Lesser Known Commands in Splunk Search Processing Language ...
 
State of the Dolphin - May 2022
State of the Dolphin - May 2022State of the Dolphin - May 2022
State of the Dolphin - May 2022
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Node.js and the MySQL Document Store
Node.js and the MySQL Document StoreNode.js and the MySQL Document Store
Node.js and the MySQL Document Store
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Elk
Elk Elk
Elk
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 

Similaire à Informational Referential Integrity Constraints Support in Apache Spark with Ioana Delaney and Suresh Thalamati

2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in sparkChester Chen
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
BI Apps Architecture
BI Apps ArchitectureBI Apps Architecture
BI Apps ArchitectureDylan Wan
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunk
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cRonald Francisco Vargas Quesada
 
Lightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional dataLightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional dataPavel Hardak
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
 
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...European SharePoint Conference
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuSpark Summit
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesCidar Mendizabal
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...Datavail
 
Beginners guide to_optimizer
Beginners guide to_optimizerBeginners guide to_optimizer
Beginners guide to_optimizerMaria Colgan
 
Designing high performance datawarehouse
Designing high performance datawarehouseDesigning high performance datawarehouse
Designing high performance datawarehouseUday Kothari
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersAdam Hutson
 

Similaire à Informational Referential Integrity Constraints Support in Apache Spark with Ioana Delaney and Suresh Thalamati (20)

2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
BI Apps Architecture
BI Apps ArchitectureBI Apps Architecture
BI Apps Architecture
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
SplunkLive! Advanced Session
SplunkLive! Advanced SessionSplunkLive! Advanced Session
SplunkLive! Advanced Session
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Lightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional dataLightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional data
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
 
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
 
Beginners guide to_optimizer
Beginners guide to_optimizerBeginners guide to_optimizer
Beginners guide to_optimizer
 
Designing high performance datawarehouse
Designing high performance datawarehouseDesigning high performance datawarehouse
Designing high performance datawarehouse
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 

Dernier (20)

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Informational Referential Integrity Constraints Support in Apache Spark with Ioana Delaney and Suresh Thalamati

  • 1. Ioana Delaney, Suresh Thalamati Spark Technology Center, IBM Informational Referential Integrity Constraints Support in Apache Spark
  • 2. About the Speakers • Ioana Delaney – Spark Technology Center, IBM – DB2 Optimizer developer working in the areas of query semantics, rewrite, and optimizer. – Worked on various releases of DB2 LUW and DB2 with BLU Acceleration – Apache Spark SQL Contributor • Suresh Thalamati – Spark Technology Center, IBM – Apache Derby Committer and PMC Member, Apache Spark Contributor – Worked on various releases of IBM BigInsights, Apache Derby, Informix Database.
  • 3. IBM Spark Technology Center • Founded in 2015 • Location: 505 Howard St., San Francisco • Web: http://spark.tc • Twitter: @apachespark_tc • Mission: – Contribute intellectual and technical capital to the Apache Spark community. – Make the core technology enterprise and cloud-ready. – Build data science skills to drive intelligence into business applications http://bigdatauniversity.com
  • 4. Motivation • Open up an area of query optimization techniques that rely on referential integrity (RI) constraints semantics • Support for informational primary key and foreign key (referential integrity) constraints • Not enforced by the Spark SQL engine; rather used by Catalyst to optimize the query processing • Targeted to applications that load and analyze data that originated from a Data Warehouse for which the conditions for a given constraint are known to be true • Improvement of up to 8x for some of the TPC-DS queries
  • 5. What is Data Warehouse? • Relational database that integrates data from multiple heterogeneous sources e.g. transactional data, files, other sources • Designed for data modelling and analysis • Provides information around a subject of a business e.g. product, customers, suppliers, etc. • The most important requirements are query performance and data simplicity • Based on a dimensional, or Star Schema model – Consists of a fact table referencing a number of dimension tables – Fact table contains the main data, or measurements, of a business – Dimension tables, usually smaller tables, describe the different characteristics, or dimensions, of a business
  • 6. TPC-DS Benchmark • Proxy of a real organization data warehouse • De-facto industry standard benchmark for measuring the performance of decision support solutions such as RDBMS and Hadoop/Spark based systems • The underlying business model is a retail product supplier e.g. retail sales, web, catalog data, inventory, demographics, etc • Examines large volumes of data e.g. 1TB to 100TB • Executes SQL queries of various operational requirements and complexities e.g. ad-hoc, reporting, data mining Excerpt from store_sales fact table diagram:
  • 7. Integrity Constraints in Data Warehouse • Typical constraints: – Unique – Not Null – Primary key and foreign key (referential integrity) • Used for: – Data cleanliness – Query optimizations § Constraint states: § Enforced § Validated § Informational
  • 8. • Implement powerful optimizations based on RI semantics e.g. Join Elimination • Example using a typical user scenario: queries against views create view customer_purchases_2002 (id, last, first, product, store_id, month, quantity) as select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, d_moy, ss_quantity from store_sales, date_dim, customer, item, store where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and d_year = 2002 select id, first, last, product, quantity from customer_purchases _2002 where product like ‘bicycle%’ and month between 1 and 2 User view: User query: How do Optimizers use RI Constraints? Internal optimizer query processing: Selects only a subset of columns from view Join between store and store_sales removed based on RI analysis
  • 9. Let’s make Spark do this too! • Introduce query optimization techniques in Spark that rely on RI semantics § Star schema detection/Star-join optimizations § Existential subquery to Inner join transformation § Group by push down through joins § Many others
  • 10. About Catalyst • Apache Spark’s Optimizer • Queries are represented internally by trees of operators e.g. logical trees and physical trees • Trees are manipulated by rules • Each compiler phase applies a different set of rules • For example, in the Logical Plan Optimization phase: – Rules rewrite logical plans into semantically equivalent ones for better performance – Rules perform natural heuristics: • e.g. merge query blocks, push down predicates, remove unreferenced columns, etc. – Earlier rules enable later rules • e.g. merge query blocks à globaljoin reorder
  • 11. Star Schema Optimizations • Queries against star schema are expected to run fast based on RI relationships among the tables • In a query, star schema detection algorithm: – Observes RI relationships based on the join predicates – Finds the tables connected in a star-join – Lets the Optimizer plan the star-join tables in an optimal way • SPARK-17791 implements star schema detection based on heuristics • Instead, use RI information to make the algorithm more robust
  • 12. Star Schema Optimizations Executionplan transformation: select i_item_id, s_store_id, avg(ss_net_profit)as store_sales_profit, avg(sr_net_loss) as store_returns_loss from store_sales, store_returns, date_dim d1, date_dim d2, store, item where i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and ss_customer_sk = sr_customer_sk and ss_item_sk = sr_item_sk and ss_ticket_number = sr_ticket_number and sr_returned_date_sk = d2.d_date_sk and d1.d_moy = 4 and d1.d_year = 1998 and. . . group by i_item_id, s_store_id order by i_item_id, s_store_id Simplified TPC-DS Query 25 Star schema diagram: item store date_dim store_sales store_returns date_dim N:1 1 : N1: N 1:N 1:N • Query execution drops from 421 secs to 147 secs (1TB TPC-DS setup), ~ 3x improvement
  • 13. Star Schema Optimizations • TPC-DS query speedup: 2x – 8x • By observing RI relationships among the tables, Optimizer makes better planning decisions • Reduce the data early in the execution plan • Reduce, or eliminate Sort Merge joins in favor of more efficient Broadcast Hash joins TPC-DS 1TB performance results with star schema detection:
  • 14. Existential Subquery to Inner Join • Applied to certain types of EXISTS/IN subqueries • Spark uses Left Semi-join – Returns a row of the outer if there is at least one match in the inner – Imposes a certain order of join execution • Instead, use more flexible Inner joins – If the subquery produces at most one row, or by introducing a Distinct on the outer table’s primary key to remove the duplicate rows after the join – Allows subquery tables to be merged into the outer query block Ø Enable global join reorder, star-schema detection, other rewrites EXISTS with correlated subquery EXISTS à INNER JOIN
  • 15. Existential Subquery to Inner Join • Query execution drops from 167 secs to 22 secs (1TB TPC-DS setup), 7x improvement Execution plan transformation: select cd_gender, cd_marital_status, cd_education_status, count(*) cnt1, ... from customer c, customer_address ca, customer_demographics where c.c_current_addr_sk = ca.ca_address_sk and ca_county in ('Woods County','Madison County') and cd_demo_sk = c.c_current_cdemo_sk and EXISTS (select * from store_sales, date_dim where c.c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3) group by cd_gender, cd_marital_status, cd_education_status, … order by cd_gender, cd_marital_status, cd_education_status, … limit 100 Simplified TPC-DS query 10 Star schema diagram: customer date_dim store_sales customer_address customer_demographic N: 1 1:N 1:N N : 1
  • 16. Existential Subquery to Inner Join • Inner joins provide better alternative plan choices for the Optimizer – Avoids joins with large inners, and thus favors Broadcast Hash Joins • Tables in the subquery are merged into the outer query block – Enables other rewrites – Benefits from star schema detection • May introduce a Distinct/HashAggregate TPC-DS 1TB performance results with subquery to join:
  • 17. Group By Push Down Through Join • In general, applied based on the cost/selectivity estimates • If the join is an RI join, heuristically push down Group By to the fact table – The input to the Group By remains the same before and after the join – The input to the join is reduced – Overall reduction of the execution time Aggregate functions on fact table columns Grouping columns is a superset of join columns PK – FK joins
  • 18. Group By Push Down Through Join Execution plan transformation: • Query execution drops from 70 secs to 30 secs (1TB TPC-DS setup), 2x improvement select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, sum(ss.ss_quantity) as store_sales_quantity from store_sales ss, date_dim, customer, store where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and s_store_sk = ss_store_sk and d_year between 2000and 2002 group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name order by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name limit 100; Star schema: customer date_dim store_sales store N : 1 1:NN:1
  • 19. More Optimizations • RI join elimination – Eliminates dimension tables if none of their columns, other than the PK columns, are referenced in the query • Redundant join elimination – Eliminates self-joins on unique keys; self-joins may be introduced after view expansion • Distinct elimination – Distinct can be removed if it is proved that the operation produces unique output • Proving maximal cardinality – Used by query rewrite
  • 20. Informational RI Implementation in Spark • DDL Support for constraint lifecycle • Metadata Storage • Constraint Maintenance
  • 21. DDL Support • Support Informational Primary Key and Foreign Key constraint • New DDLs for create, alter, drop constraint • Create Constraint: – Constraints can be added as part of CREATE TABLE and ALTER TABLE – Create table DDL supports both inline and out_of_line syntax • Syntax is similar to Hive 2.1.0 DDL CREATE TABLE customer (id int CONSTRAINT pk1 PRIMARY KEY RELY, name string, address string, demo int) CREATE TABLE customer_demographics(id int, gender string, credit_rating string, CONSTRAINT pk1 PRIMARY KEY (id) RELY); ALTER TABLE customer ADD CONSTRAINT fk1 FOREIGN KEY (demo) REFERENCES customer_demographics (id) VALIDATE RELY;
  • 22. Constraint States and Options • Two states: VALIDATE and NOVALIDATE – Specify if the existing data conforms to the constraint • Two options: RELY and NORELY – Specify whether a constraint is to be taken into account for query rewrite • Used by Catalyst as follows: – NORELY: constraint cannot be used by Catalyst – NOVALIDATE RELY: constraint used for join order optimizations – VALIDATE RELY: constraint used for query rewrite and join order optimizations • Constraints are not enforced, they are informational only. • Namespace for a constraint is at the table level
  • 23. Metadata Storage • Constraint definitions are stored in the table properties • Constraint information is stored in JSON format • Describe formatted includes constraint information # ConstraintInformation Primary Key: ConstraintName: pk1 Key Columns: [id] Foreign Keys: ConstraintName: fk1 Key Columns: [demo] Reference Table: customer_demographics Reference Columns:[id] spark.sql.constraint={"id":"fk1","type":"fk","keyCols”["demo"],"referenceTable":"customer_demographics", "referenceCols":["id"],"rely":true,"validate":true}
  • 24. Constraint Maintenance • Dropping a constraint • Validating Constraints – Internally Spark SQL queries are run to perform the validation. • Data cache invalidation – Entries in the cache that reference the altered table are invalidated ALTER TABLE customer_demographics DROP CONSTRAINTpk1 CASCADE DROP TABLE IF EXISTS customer_demographicsCASCADE CONSTRAINT ALTER TABLE customer VALIDATE CONSTRAINT pk1
  • 25. 1TB TPC-DS Performance Results Cluster: 4-node cluster, each node having: 12 2 TB disks, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 128 GB RAM, 10 Gigabit Ethernet Number of cores: 48 Apache Hadoop 2.7.3, Apache Spark 2.2 main ( Mar 31, 2017) Database info: Schema: TPCDS Scale factor: 1TB total space Storage format: Parquet with Snappy compression TPC-DS Query spark-2.2 (secs) spark-2.2-ri (secs) Speedup Q06 106 19 5x Q10 355 190 2x Q13 296 98 3x Q15 147 17 8x Q16 1394 706 2x Q17 398 146 2x Q24 485 249 2x Q25 421 147 2x Q29 380 126 3x Q35 462 285 1.5x Q45 93 17 5x Q69 327 173 1.8x Q74 237 119 2x Q85 104 42 2x Q94 603 307 2x
  • 26. Call to Action • Read the Spec in SPARK-19842 https://issues.apache.org/jira/browse/SPARK-19842 • Prototype code is under internal review • Watch out for the upcoming PRs • Meet us at the IBM demo booth #407
  • 27. Thank You. Ioana Delaney ursu@us.ibm.com Suresh Thalamati tsuresh@us.ibm.com Visit http://spark.tc