SlideShare une entreprise Scribd logo
1  sur  31
Joins in a distributed world
@LucianPrecup
@dmconf Barcelona 2015 #dmconf15
2015-11-21
whoami
• CTO of Adelean (http://adelean.com/)
• Integrate
– search, nosql and big data technologies
• to support
– ETL, BI, data mining, data processing and data
visualization
• use cases
2015-11-21 2@LucianPrecup #dmconf15 Barcelona
Poll - How many of you …
• Use a NoSQL database within your current
application / information system?
• Which NoSQL database?
• Use a relational database within your current
application / information system?
• Which RDBMS?
• Store the data into more than one system?
• Distribute data on several servers?
• Use an ORM framework (hibernate, doctrine, …) ?
2015-11-21 3@LucianPrecup #dmconf15 Barcelona
Distributed databases
• Issues
– CAP theorem
– Two phase commit
– Distributed Transactions
2015-11-21 4@LucianPrecup #dmconf15 Barcelona
Objective
• Show what distributed databases are able or
not able to do about joins and why
• Give YOU the ability to implement joins in
your applications (your service layers)
2015-11-21 5@LucianPrecup #dmconf15 Barcelona
Joins
• A join combines the output from two sources
• A join condition – the relationship between
the sources
• Join trees
2015-11-21 6@LucianPrecup #dmconf15 Barcelona
The "Query Optimizer"
SELECT DISTINCT offer_status FROM offer;
SELECT offer_status FROM offer GROUP by offer_status;
≡
SELECT A.ID as ID1, A.X, B.ID as ID2, B.Y
FROM A LEFT OUTER JOIN B
ON A.ID = B.ID
WHERE ID2 IS NOT NULL
SELECT A.ID as ID1, A.X, B.ID as ID2, B.Y
FROM A INNER JOIN B
ON A.ID = B.ID
≡
2015-11-21 7@LucianPrecup #dmconf15 Barcelona
The query optimizer - decisions
• Access paths
– The optimizer must choose an access path to retrieve data from each table in
the join statement. For example, choose between a full table scan or an index
scan.
• Join methods
– To join each pair of row sources, the database must decide how to do it:
nested loop, sort merge, hash joins. Each join method has specific situations in
which it is more suitable than the others.
• Join types
– The join condition determines the join type: an inner join retrieves only rows
that match the join condition, an outer join retrieves rows that do not match
the join condition.
• Join order
– To execute a statement that joins more than two tables, the database joins
two tables and then joins the resulting row source to the next table and so on.
• Calculating the cost of a query plan
– Based on metrics : I/Os (single block I/O, multiblock I/Os), estimated CPU,
functions and expressions
2015-11-21 8@LucianPrecup #dmconf15 Barcelona
Join methods
• Nested Loops Joins
2015-11-21 9@LucianPrecup #dmconf15 Barcelona
Join methods
• Hash Joins
2015-11-21 10@LucianPrecup #dmconf15 Barcelona
Join methods
• Sort Merge Joins
2015-11-21 11@LucianPrecup #dmconf15 Barcelona
1
3
7
9
1
2
2
7
8
9
Join types
• Inner Joins
– Equijoins
– nonequijoins
• Outer Joins
– Nested Loop Outer Joins
– Hash Join Outer Joins
– Sort Merge Outer Joins
– Full Outer Joins
– Multiple Tables on the Left of an Outer Join
• Semijoins
– ~ IN or EXISTS clause
• Antijoins
– ~NOT IN or NOT EXISTS
2015-11-21 12@LucianPrecup #dmconf15 Barcelona
Optimizing tricks
• Push filters as low as possible in the execution tree
• Cache small tables into memory
• Calculate statistics on each datasources and use them
• Think use cases
• “Fortunately, many of the lessons learned over the
years for optimizing and executing join queries can be
directly translated to the distributed world.” (see Nicolas
Bruno et all. in the references below)
2015-11-21 13@LucianPrecup #dmconf15 Barcelona
Push filters
Table 1 Table 2 Table 1 Table 2
Filter 1
Join
Join‘
Filter 1’ Filter 1’’
2015-11-21 14@LucianPrecup #dmconf15 Barcelona
Take advantage of small tables
Table 1
Table 2
Join
Table 1
Filter
with data from
Table 2
2015-11-21 15@LucianPrecup #dmconf15 Barcelona
Use statistics
Table 1 Table 2
Join 1
Table 3
Join 2
Table 3
Join 3
Table 2
Join 4
Table 1
2015-11-21 16@LucianPrecup #dmconf15 Barcelona
Other join optimisations (technical)
• Join algorithms
– Different ways to implement a logical join operator : nested-loop joins,
sort-based joins, hash-based joins.
– Auxiliary data structures: secondary indexes, join indexes, bitmap
indexes, bloom filters (e.g., indexed loop join, indexed sort-merge join,
and distributed semi-join).
• Bloom Filters
– especially useful when the amount of memory needed to store the
filter is small relative to the amount of data in the data set, and when
most data is expected to fail the membership test.
• Partition-Wise Joins
– A partition-wise join is a join optimization that divides a large join of
two tables, one of which must be partitioned on the join key, into
several smaller joins.
• Full partition-wise join : Both tables must be equipartitioned on their join keys.
We can then divide a large join into smaller joins between two partitions.
• Partial partition-wise joins : Only one table is partitioned on the join key. The
other table may or may not be partitioned.
2015-11-21 17@LucianPrecup #dmconf15 Barcelona
The "Query Optimizer"
NoSQLSQL/RDBMS Power to the DBA
2015-11-21 18@LucianPrecup #dmconf15 Barcelona
The "Query Optimizer"
SQL/RDBMS Power to the DBA NoSQL Power to the developer
192015-11-21 @LucianPrecup #dmconf15 Barcelona
The "Query Optimizer"
SQL/RDBMS NoSQL / Distributed / Multi-Model
2015-11-21 20@LucianPrecup #dmconf15 Barcelona
Distributing data
• Clusters of nodes
• Partitioning the data (sharding)
• Partition keys (random or functional)
• Joining heterogeneous systems
Persons
(Oracle)
Contracts
(SQL Server)
Logs
(Elasticsearch)
2015-11-21 21@LucianPrecup #dmconf15 Barcelona
Partitions (shards)
• "Random" partitioning
• "Functional" routing key  colocated joins
Data A Data B Data C
Customers by
customer_id
Customers by
customer_id
Contracts by
customer_id
Contracts by
customer_id
Contracts by
customer_id
2015-11-21 22@LucianPrecup #dmconf15 Barcelona
The join graph topology
• Partitioning schemes: a partition function (e.g., hash
partitioning, range partitioning, random partitioning, and
custom partitioning), a partition key, and a partition count.
• Data exchange operators: initial partitioning,
repartitioning, full merge, partial repartitioning and partial
merge
• Merging schemes: random merge, sort merge, concat-
merge and sort concat-merge.
• Distribution policies: include distribution with duplication
and distribution without duplication
2015-11-21 23@LucianPrecup #dmconf15 Barcelona
Joins from a functional perspective
• Think use cases : Model and distribute the data according
to the queries you are expecting
– Enrich a table with data from lookup tables, I need objects of
type A filtered by criteria on properties of type B
• Loosen the generality
– I only want star schemas, I only want to bulk load data at night
and query it all day, I only want to run a few really expensive
queries not millions of tiny ones
• Polyglot persistence : store your data in multiple ways
– Graph databases for relations, document store for business
objects, key value for lookup properties
• Services and micro-services : you access the data through
the services layer
– Then you can implement the joins in many ways
2015-11-21 24@LucianPrecup #dmconf15 Barcelona
The service layer
• Seamless integration for applications
• Caches the data model and exposes a business
model (services fit for business needs)
Service Layer
Customers Contracts …
Application 1 Application 2 …
2015-11-21 25@LucianPrecup #dmconf15 Barcelona
Implementing your own joins
• According to use cases and capacities of each module, joins
can be implemented at different levels
26
NoSQL Front End
Service Layer
Source 1
Source 2
Batch
(init or delta)
The datasource
executes the join
Real time data
injection
Join executed by the
NoSQL database
(~E.g. Parent Child in
Elasticsearch)
Join implemented by the
Service layer at query time
(~"My Own Custom Join")
JMS
Queue
"GET" calls to the second
data source
(~Nested Loops Join)
Parallel reading of the
two datasources and
join done by the Batch
(~Sort Merge Join)
Read one source, hash it
then stream the other one
(~HashJoin)
2015-11-21
Joins with NoSQL databases
Normalized database Document
{"film" : {
"id" : "183070",
"title" : "The Artist",
"published" : "2011-10-12",
"genre" : ["Romance", "Drama", "Comedy"],
"language" : ["English", "French"],
"persons" : [
{"person" : { "id" : "5079", "name" : "Michel
Hazanavicius", "role" : "director" }},
{"person" : { "id" : "84145", "name" : "Jean
Dujardin", "role" : "actor" }},
{"person" : { "id" : "24485", "name" : "Bérénice
Bejo", "role" : "actor" }},
{"person" : { "id" : "4204", "name" : "John
Goodman", "role" : "actor" }}
]
}}
2015-11-21 27@LucianPrecup #dmconf15 Barcelona
SQL vs. NoSQL : the issue with joins :-)
• Let’s say you have two relational entities: Persons
and Contracts
– A Person has zero, one or more Contracts
– A Contract is attached to one or more Persons (eg. the
Subscriber, the Grantee, …)
• Need a search services :
– S1: getPersonsDetailsByContractProperties
– S2: getContractsDetailsByPersonProperties
• Simple solution with SQL:
SELECT P.* FROM P, C WHERE P.id = C.pid AND C.a = 'A‘
SELECT C.* FROM P, C WHERE P.id = C.pid AND P.a = 'A'
2015-11-21 28@LucianPrecup #dmconf15 Barcelona
The issue with joins - solutions
• Solution 1
– Store Persons with Contracts together for S1
{"person" : { "details" : …, … , "contracts" : ["contract" :{"id" : 1, …}, …] }}
– Store Contracts with Persons together for S2
{"contract" : { "details" : …, …, "persons" : ["person" :{"id" : 1, "role" : "S", …}, …]}}
• Issues with solution 1:
– A lot of data duplication
– Have to get Contracts when indexing Persons and vice-versa
• Solution 2
– Use the joins provided by the NoSQL system (Eg. Elasticsearch’s Parent/Child)
• Issues with solution 2:
– Works in one way but not the other (only one parent for n children, a 1 to n relationship)
• Solution 3
– Store Persons and Contracts separately
– Launch two NoSQL queries and join the results into your application to get the final response
– For S1 : First get all Contract ids by Contract properties, then get Persons by Contract ids (terms
query or mget)
– For S2 : First get all Persons ids by Person properties, then get Contracts by Person ids (terms
query or mget)
– The response to the second query can be returned “as is” to the client (pagination, etc.)
2015-11-21 29@LucianPrecup #dmconf15 Barcelona
Optimizing tricks. Distributed.
• Model the data according to use cases
• Duplicate the data if different modeling
schemas are needed
• Loosen the generality
• Choose a good partitioning key
• Colocate joins as much as possible. The less
redistribution the better.
• Implement your own joins :-)
2015-11-21 30@LucianPrecup #dmconf15 Barcelona
References
• Back to the future : SQL 92 for Elasticsearch? - Lucian Precup - NoSQL
Matters Dublin 2014 (https://2014.nosql-matters.org/dub/wp-
content/uploads/2014/09/lucian_precup_back_to_the_future_sql_92_for
_elasticsearch.pdf)
• Oracle Database Online Documentation - Database SQL Tuning Guide
(https://docs.oracle.com/database/121/TGSQL/tgsql_join.htm#TGSQL242)
• Advanced Join Strategies for Large-Scale Distributed Computation - Nicolas
Bruno, YongChul Kwon, Ming-Chuan Wu - VLDB
(http://www.vldb.org/pvldb/vol7/p1484-bruno.pdf)
• “Distributed joins are hard to scale”- Interview with Dwight Merriman by
Roberto V. Zicari (http://www.odbms.org/blog/2011/02/distributed-joins-
are-hard-to-scale-interview-with-dwight-merriman/)
• Joins and aggregations in a distributed NoSQL DB - Max Neunhöffer –
NoSQL Matters Dublin 2014 (https://2014.nosql-matters.org/dub/wp-
content/uploads/2014/09/NeunhoefferDublin.pdf)
2015-11-21 31@LucianPrecup #dmconf15 Barcelona

Contenu connexe

En vedette

Sydney CBD and South East Light Rail
Sydney CBD and South East Light RailSydney CBD and South East Light Rail
Sydney CBD and South East Light Rail
JumpingJaq
 
НСИ "Контрагенты"
НСИ "Контрагенты"НСИ "Контрагенты"
НСИ "Контрагенты"
Datamodel
 
Build it and they will come if it’s good enough: The Impact of TransportInfra...
Build it and they will come if it’s good enough: The Impact of TransportInfra...Build it and they will come if it’s good enough: The Impact of TransportInfra...
Build it and they will come if it’s good enough: The Impact of TransportInfra...
JumpingJaq
 
"KEYNOTE PRESENTATION Easing Sydney’s Congestion – developing a road network...
"KEYNOTE PRESENTATION  Easing Sydney’s Congestion – developing a road network..."KEYNOTE PRESENTATION  Easing Sydney’s Congestion – developing a road network...
"KEYNOTE PRESENTATION Easing Sydney’s Congestion – developing a road network...
JumpingJaq
 

En vedette (20)

The Build-Out of America and Impact of Natural Hazards
The Build-Out of America and Impact of Natural HazardsThe Build-Out of America and Impact of Natural Hazards
The Build-Out of America and Impact of Natural Hazards
 
Magnet Minsk Cataloge 02- R02
Magnet Minsk Cataloge 02- R02Magnet Minsk Cataloge 02- R02
Magnet Minsk Cataloge 02- R02
 
Stella's favorite foods
Stella's favorite foodsStella's favorite foods
Stella's favorite foods
 
Sydney CBD and South East Light Rail
Sydney CBD and South East Light RailSydney CBD and South East Light Rail
Sydney CBD and South East Light Rail
 
2015 Q3 QReview-Email Version
2015 Q3 QReview-Email Version2015 Q3 QReview-Email Version
2015 Q3 QReview-Email Version
 
Система резервного копирования OpenLUN backup v1.2.1
Система резервного копирования OpenLUN backup v1.2.1Система резервного копирования OpenLUN backup v1.2.1
Система резервного копирования OpenLUN backup v1.2.1
 
RAPORT AOD 2014 - RO
RAPORT AOD 2014 - RORAPORT AOD 2014 - RO
RAPORT AOD 2014 - RO
 
НСИ "Контрагенты"
НСИ "Контрагенты"НСИ "Контрагенты"
НСИ "Контрагенты"
 
Improving Educational Outcomes for Poor Children
Improving Educational Outcomes for Poor ChildrenImproving Educational Outcomes for Poor Children
Improving Educational Outcomes for Poor Children
 
Start raise
Start raiseStart raise
Start raise
 
Build it and they will come if it’s good enough: The Impact of TransportInfra...
Build it and they will come if it’s good enough: The Impact of TransportInfra...Build it and they will come if it’s good enough: The Impact of TransportInfra...
Build it and they will come if it’s good enough: The Impact of TransportInfra...
 
Technology Stack
Technology StackTechnology Stack
Technology Stack
 
Flipay pitch-deck
Flipay pitch-deckFlipay pitch-deck
Flipay pitch-deck
 
"KEYNOTE PRESENTATION Easing Sydney’s Congestion – developing a road network...
"KEYNOTE PRESENTATION  Easing Sydney’s Congestion – developing a road network..."KEYNOTE PRESENTATION  Easing Sydney’s Congestion – developing a road network...
"KEYNOTE PRESENTATION Easing Sydney’s Congestion – developing a road network...
 
Sird milano dic. '16 ornella
Sird milano dic. '16 ornellaSird milano dic. '16 ornella
Sird milano dic. '16 ornella
 
Когда и зачем нужен мониторинг приложений
Когда и зачем нужен мониторинг приложенийКогда и зачем нужен мониторинг приложений
Когда и зачем нужен мониторинг приложений
 
Карима Нигматулина: использование Big data на примере кейсов Штатов и России
Карима Нигматулина: использование Big data на примере кейсов Штатов и РоссииКарима Нигматулина: использование Big data на примере кейсов Штатов и России
Карима Нигматулина: использование Big data на примере кейсов Штатов и России
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Solr and Elasticsearch in Action (at Breizhcamp)
Solr and Elasticsearch in Action (at Breizhcamp)Solr and Elasticsearch in Action (at Breizhcamp)
Solr and Elasticsearch in Action (at Breizhcamp)
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 

Similaire à Joins in a distributed world Distributed Matters Barcelona 2015

Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 

Similaire à Joins in a distributed world Distributed Matters Barcelona 2015 (20)

Joins in a distributed world - Lucian Precup
Joins in a distributed world - Lucian Precup Joins in a distributed world - Lucian Precup
Joins in a distributed world - Lucian Precup
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
 
BDA-Module-1.pptx
BDA-Module-1.pptxBDA-Module-1.pptx
BDA-Module-1.pptx
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Lecture10WebVer.ppt SAD
Lecture10WebVer.ppt SADLecture10WebVer.ppt SAD
Lecture10WebVer.ppt SAD
 
Software Architecture Lecture 10 web version
Software Architecture Lecture 10 web versionSoftware Architecture Lecture 10 web version
Software Architecture Lecture 10 web version
 
Lecture10WebVer.ppt
Lecture10WebVer.pptLecture10WebVer.ppt
Lecture10WebVer.ppt
 
Microservices - Is it time to breakup?
Microservices - Is it time to breakup? Microservices - Is it time to breakup?
Microservices - Is it time to breakup?
 
Database :Introduction to Database System
Database :Introduction to Database SystemDatabase :Introduction to Database System
Database :Introduction to Database System
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 Questions
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Meetup 25/04/19: Big Data
Meetup 25/04/19: Big DataMeetup 25/04/19: Big Data
Meetup 25/04/19: Big Data
 

Plus de Lucian Precup

La revue de code : agile, lean, indispensable !
La revue de code : agile, lean, indispensable !La revue de code : agile, lean, indispensable !
La revue de code : agile, lean, indispensable !
Lucian Precup
 

Plus de Lucian Precup (9)

Enrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolatorEnrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolator
 
Search and nosql for information management @nosqlmatters Cologne
Search and nosql for information management @nosqlmatters CologneSearch and nosql for information management @nosqlmatters Cologne
Search and nosql for information management @nosqlmatters Cologne
 
Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014
Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014
Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014
 
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
Back to the future : SQL 92 for Elasticsearch @nosqlmatters ParisBack to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
 
Search, nosql et bigdata avec les moteurs de recherche
Search, nosql et bigdata avec les moteurs de rechercheSearch, nosql et bigdata avec les moteurs de recherche
Search, nosql et bigdata avec les moteurs de recherche
 
ALM et Agilite : la convergence
ALM et Agilite : la convergenceALM et Agilite : la convergence
ALM et Agilite : la convergence
 
La revue de code : facile !
La revue de code : facile !La revue de code : facile !
La revue de code : facile !
 
La revue de code : agile, lean, indispensable !
La revue de code : agile, lean, indispensable !La revue de code : agile, lean, indispensable !
La revue de code : agile, lean, indispensable !
 
Moteurs de recherche et Lucene at LorraineJUG
Moteurs de recherche et Lucene at LorraineJUGMoteurs de recherche et Lucene at LorraineJUG
Moteurs de recherche et Lucene at LorraineJUG
 

Dernier

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Dernier (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 

Joins in a distributed world Distributed Matters Barcelona 2015

  • 1. Joins in a distributed world @LucianPrecup @dmconf Barcelona 2015 #dmconf15 2015-11-21
  • 2. whoami • CTO of Adelean (http://adelean.com/) • Integrate – search, nosql and big data technologies • to support – ETL, BI, data mining, data processing and data visualization • use cases 2015-11-21 2@LucianPrecup #dmconf15 Barcelona
  • 3. Poll - How many of you … • Use a NoSQL database within your current application / information system? • Which NoSQL database? • Use a relational database within your current application / information system? • Which RDBMS? • Store the data into more than one system? • Distribute data on several servers? • Use an ORM framework (hibernate, doctrine, …) ? 2015-11-21 3@LucianPrecup #dmconf15 Barcelona
  • 4. Distributed databases • Issues – CAP theorem – Two phase commit – Distributed Transactions 2015-11-21 4@LucianPrecup #dmconf15 Barcelona
  • 5. Objective • Show what distributed databases are able or not able to do about joins and why • Give YOU the ability to implement joins in your applications (your service layers) 2015-11-21 5@LucianPrecup #dmconf15 Barcelona
  • 6. Joins • A join combines the output from two sources • A join condition – the relationship between the sources • Join trees 2015-11-21 6@LucianPrecup #dmconf15 Barcelona
  • 7. The "Query Optimizer" SELECT DISTINCT offer_status FROM offer; SELECT offer_status FROM offer GROUP by offer_status; ≡ SELECT A.ID as ID1, A.X, B.ID as ID2, B.Y FROM A LEFT OUTER JOIN B ON A.ID = B.ID WHERE ID2 IS NOT NULL SELECT A.ID as ID1, A.X, B.ID as ID2, B.Y FROM A INNER JOIN B ON A.ID = B.ID ≡ 2015-11-21 7@LucianPrecup #dmconf15 Barcelona
  • 8. The query optimizer - decisions • Access paths – The optimizer must choose an access path to retrieve data from each table in the join statement. For example, choose between a full table scan or an index scan. • Join methods – To join each pair of row sources, the database must decide how to do it: nested loop, sort merge, hash joins. Each join method has specific situations in which it is more suitable than the others. • Join types – The join condition determines the join type: an inner join retrieves only rows that match the join condition, an outer join retrieves rows that do not match the join condition. • Join order – To execute a statement that joins more than two tables, the database joins two tables and then joins the resulting row source to the next table and so on. • Calculating the cost of a query plan – Based on metrics : I/Os (single block I/O, multiblock I/Os), estimated CPU, functions and expressions 2015-11-21 8@LucianPrecup #dmconf15 Barcelona
  • 9. Join methods • Nested Loops Joins 2015-11-21 9@LucianPrecup #dmconf15 Barcelona
  • 10. Join methods • Hash Joins 2015-11-21 10@LucianPrecup #dmconf15 Barcelona
  • 11. Join methods • Sort Merge Joins 2015-11-21 11@LucianPrecup #dmconf15 Barcelona 1 3 7 9 1 2 2 7 8 9
  • 12. Join types • Inner Joins – Equijoins – nonequijoins • Outer Joins – Nested Loop Outer Joins – Hash Join Outer Joins – Sort Merge Outer Joins – Full Outer Joins – Multiple Tables on the Left of an Outer Join • Semijoins – ~ IN or EXISTS clause • Antijoins – ~NOT IN or NOT EXISTS 2015-11-21 12@LucianPrecup #dmconf15 Barcelona
  • 13. Optimizing tricks • Push filters as low as possible in the execution tree • Cache small tables into memory • Calculate statistics on each datasources and use them • Think use cases • “Fortunately, many of the lessons learned over the years for optimizing and executing join queries can be directly translated to the distributed world.” (see Nicolas Bruno et all. in the references below) 2015-11-21 13@LucianPrecup #dmconf15 Barcelona
  • 14. Push filters Table 1 Table 2 Table 1 Table 2 Filter 1 Join Join‘ Filter 1’ Filter 1’’ 2015-11-21 14@LucianPrecup #dmconf15 Barcelona
  • 15. Take advantage of small tables Table 1 Table 2 Join Table 1 Filter with data from Table 2 2015-11-21 15@LucianPrecup #dmconf15 Barcelona
  • 16. Use statistics Table 1 Table 2 Join 1 Table 3 Join 2 Table 3 Join 3 Table 2 Join 4 Table 1 2015-11-21 16@LucianPrecup #dmconf15 Barcelona
  • 17. Other join optimisations (technical) • Join algorithms – Different ways to implement a logical join operator : nested-loop joins, sort-based joins, hash-based joins. – Auxiliary data structures: secondary indexes, join indexes, bitmap indexes, bloom filters (e.g., indexed loop join, indexed sort-merge join, and distributed semi-join). • Bloom Filters – especially useful when the amount of memory needed to store the filter is small relative to the amount of data in the data set, and when most data is expected to fail the membership test. • Partition-Wise Joins – A partition-wise join is a join optimization that divides a large join of two tables, one of which must be partitioned on the join key, into several smaller joins. • Full partition-wise join : Both tables must be equipartitioned on their join keys. We can then divide a large join into smaller joins between two partitions. • Partial partition-wise joins : Only one table is partitioned on the join key. The other table may or may not be partitioned. 2015-11-21 17@LucianPrecup #dmconf15 Barcelona
  • 18. The "Query Optimizer" NoSQLSQL/RDBMS Power to the DBA 2015-11-21 18@LucianPrecup #dmconf15 Barcelona
  • 19. The "Query Optimizer" SQL/RDBMS Power to the DBA NoSQL Power to the developer 192015-11-21 @LucianPrecup #dmconf15 Barcelona
  • 20. The "Query Optimizer" SQL/RDBMS NoSQL / Distributed / Multi-Model 2015-11-21 20@LucianPrecup #dmconf15 Barcelona
  • 21. Distributing data • Clusters of nodes • Partitioning the data (sharding) • Partition keys (random or functional) • Joining heterogeneous systems Persons (Oracle) Contracts (SQL Server) Logs (Elasticsearch) 2015-11-21 21@LucianPrecup #dmconf15 Barcelona
  • 22. Partitions (shards) • "Random" partitioning • "Functional" routing key  colocated joins Data A Data B Data C Customers by customer_id Customers by customer_id Contracts by customer_id Contracts by customer_id Contracts by customer_id 2015-11-21 22@LucianPrecup #dmconf15 Barcelona
  • 23. The join graph topology • Partitioning schemes: a partition function (e.g., hash partitioning, range partitioning, random partitioning, and custom partitioning), a partition key, and a partition count. • Data exchange operators: initial partitioning, repartitioning, full merge, partial repartitioning and partial merge • Merging schemes: random merge, sort merge, concat- merge and sort concat-merge. • Distribution policies: include distribution with duplication and distribution without duplication 2015-11-21 23@LucianPrecup #dmconf15 Barcelona
  • 24. Joins from a functional perspective • Think use cases : Model and distribute the data according to the queries you are expecting – Enrich a table with data from lookup tables, I need objects of type A filtered by criteria on properties of type B • Loosen the generality – I only want star schemas, I only want to bulk load data at night and query it all day, I only want to run a few really expensive queries not millions of tiny ones • Polyglot persistence : store your data in multiple ways – Graph databases for relations, document store for business objects, key value for lookup properties • Services and micro-services : you access the data through the services layer – Then you can implement the joins in many ways 2015-11-21 24@LucianPrecup #dmconf15 Barcelona
  • 25. The service layer • Seamless integration for applications • Caches the data model and exposes a business model (services fit for business needs) Service Layer Customers Contracts … Application 1 Application 2 … 2015-11-21 25@LucianPrecup #dmconf15 Barcelona
  • 26. Implementing your own joins • According to use cases and capacities of each module, joins can be implemented at different levels 26 NoSQL Front End Service Layer Source 1 Source 2 Batch (init or delta) The datasource executes the join Real time data injection Join executed by the NoSQL database (~E.g. Parent Child in Elasticsearch) Join implemented by the Service layer at query time (~"My Own Custom Join") JMS Queue "GET" calls to the second data source (~Nested Loops Join) Parallel reading of the two datasources and join done by the Batch (~Sort Merge Join) Read one source, hash it then stream the other one (~HashJoin) 2015-11-21
  • 27. Joins with NoSQL databases Normalized database Document {"film" : { "id" : "183070", "title" : "The Artist", "published" : "2011-10-12", "genre" : ["Romance", "Drama", "Comedy"], "language" : ["English", "French"], "persons" : [ {"person" : { "id" : "5079", "name" : "Michel Hazanavicius", "role" : "director" }}, {"person" : { "id" : "84145", "name" : "Jean Dujardin", "role" : "actor" }}, {"person" : { "id" : "24485", "name" : "Bérénice Bejo", "role" : "actor" }}, {"person" : { "id" : "4204", "name" : "John Goodman", "role" : "actor" }} ] }} 2015-11-21 27@LucianPrecup #dmconf15 Barcelona
  • 28. SQL vs. NoSQL : the issue with joins :-) • Let’s say you have two relational entities: Persons and Contracts – A Person has zero, one or more Contracts – A Contract is attached to one or more Persons (eg. the Subscriber, the Grantee, …) • Need a search services : – S1: getPersonsDetailsByContractProperties – S2: getContractsDetailsByPersonProperties • Simple solution with SQL: SELECT P.* FROM P, C WHERE P.id = C.pid AND C.a = 'A‘ SELECT C.* FROM P, C WHERE P.id = C.pid AND P.a = 'A' 2015-11-21 28@LucianPrecup #dmconf15 Barcelona
  • 29. The issue with joins - solutions • Solution 1 – Store Persons with Contracts together for S1 {"person" : { "details" : …, … , "contracts" : ["contract" :{"id" : 1, …}, …] }} – Store Contracts with Persons together for S2 {"contract" : { "details" : …, …, "persons" : ["person" :{"id" : 1, "role" : "S", …}, …]}} • Issues with solution 1: – A lot of data duplication – Have to get Contracts when indexing Persons and vice-versa • Solution 2 – Use the joins provided by the NoSQL system (Eg. Elasticsearch’s Parent/Child) • Issues with solution 2: – Works in one way but not the other (only one parent for n children, a 1 to n relationship) • Solution 3 – Store Persons and Contracts separately – Launch two NoSQL queries and join the results into your application to get the final response – For S1 : First get all Contract ids by Contract properties, then get Persons by Contract ids (terms query or mget) – For S2 : First get all Persons ids by Person properties, then get Contracts by Person ids (terms query or mget) – The response to the second query can be returned “as is” to the client (pagination, etc.) 2015-11-21 29@LucianPrecup #dmconf15 Barcelona
  • 30. Optimizing tricks. Distributed. • Model the data according to use cases • Duplicate the data if different modeling schemas are needed • Loosen the generality • Choose a good partitioning key • Colocate joins as much as possible. The less redistribution the better. • Implement your own joins :-) 2015-11-21 30@LucianPrecup #dmconf15 Barcelona
  • 31. References • Back to the future : SQL 92 for Elasticsearch? - Lucian Precup - NoSQL Matters Dublin 2014 (https://2014.nosql-matters.org/dub/wp- content/uploads/2014/09/lucian_precup_back_to_the_future_sql_92_for _elasticsearch.pdf) • Oracle Database Online Documentation - Database SQL Tuning Guide (https://docs.oracle.com/database/121/TGSQL/tgsql_join.htm#TGSQL242) • Advanced Join Strategies for Large-Scale Distributed Computation - Nicolas Bruno, YongChul Kwon, Ming-Chuan Wu - VLDB (http://www.vldb.org/pvldb/vol7/p1484-bruno.pdf) • “Distributed joins are hard to scale”- Interview with Dwight Merriman by Roberto V. Zicari (http://www.odbms.org/blog/2011/02/distributed-joins- are-hard-to-scale-interview-with-dwight-merriman/) • Joins and aggregations in a distributed NoSQL DB - Max Neunhöffer – NoSQL Matters Dublin 2014 (https://2014.nosql-matters.org/dub/wp- content/uploads/2014/09/NeunhoefferDublin.pdf) 2015-11-21 31@LucianPrecup #dmconf15 Barcelona

Notes de l'éditeur

  1. TODO: left outer join with is not null === join
  2. A Bloom filter, named after its creator Burton Bloom, is a low-memory data structure that tests membership in a set. A Bloom filter correctly indicates when an element is not in a set, but can incorrectly indicate when an element is in a set. Thus, false negatives are impossible but false positives are possible.
  3. The *famous* Query Optimizer versus *the* developer
  4. The *famous* Query Optimizer versus *the* developer
  5. The *famous* Query Optimizer versus *the* developer
  6. Figure 2: Different types of data exchange topologies (from left to right: initial partitioning, repartitioning, full merge, partial repartitioning, and partial merge). In a distributed environment, an additional dimension is introduced into the join taxonomy: the join graph topology. Graph topologies specify how different partitions of data are processed in a distributed way, and is affected by the following factors: Partitioning schemes, which characterize how data is partitioned in the system. Partitioning schemes consist of a partition function (e.g., hash partitioning, range partitioning, random partitioning, and custom partitioning), a partition key, and a partition count. Data exchange operators, which modify the partitioning scheme of a data set, and include initial partitioning, repartitioning, full merge, partial repartitioning and partial merge Merging schemes, which modify data exchange operators by ensuring certain additional intra-partition properties (e.g., order) and include random merge, sort merge, concat-merge and sort concat-merge. Distribution policies, which dictate whether partitions can be duplicated to multiple execution nodes, and include distribution with duplication and distribution without duplication
  7. Take away this slide !
  8. TODO: review this. Talk about the service layer.
  9. Joins are not (only) where you would expect them