Physical Design for Non-Relational Data Systems

Physical Design for
Non-relational Data Systems
Michael Mior • University of Waterloo

Proper design and configuration of
data systems is critical for achieving
good performance
2

3
Many tools exist for relational
database design optimization
Source: https://www.databasejournal.com/features/mssql/article.php/10894_3523616_2/Index-Tuning-Wizard.htm
https://dev.mysql.com/doc/mysql-monitor/4.0/en/mem-qanal-using-ui.html
Microsoft AutoAdmin (1998)
DB2 Design Advisor (2004)
Oracle SQL Tuning (2004)

We want applications to be up 24/7
We're frequently dealing with changing
data or with unstructured data
We require sub-second responses to queries
4 Source: Mike Loukides, VP Content Strategy, O’Reilly Media
Relational databases are not
always sufficient for these uses

“Over 30 years, we've learned how to
write business intelligence
applications on top of relational
databases -- there are patterns. With
NoSQL today, we have no cookie
cutters. We don't have any blueprints.”
--Ravi Krishnappa, NetApp solutions architect
5 Source: TechTarget, 2015

• NoSQL Database Design Optimization
• Understanding Existing NoSQL Designs
• Optimizing Big Data Applications

Model column families around query patterns
But start your design with entities and relationships, if you can
De-normalize and duplicate for read performance
But don’t de-normalize if you don’t need to
Leverage wide rows for ordering, grouping, and filtering
But don’t go too wide
Schema Design Best Practices
Source: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
But
But
But
?
?
?
8

NoSQL Application Development
Requirements ImplementationData Model
App LogicDB Access
NoSE[MSAL, ICDE ‘16] [MSAL, TKDE ‘17]9

Database
Design
Example
Comment
com_id
com_date
text
User
user_id
nickname
Post
user_id
post_date
title
10

Database
Design
Example
11
SELECT post_id, post_title
FROM users u JOIN comments
c ON u.user_id = c.user_id
JOIN posts p
ON p.post_id = c.post_id
ORDER BY p.post_date
Query
Find information on all posts a user has
commented on in order by post date

Database
Design
Example
user_id
↓
nickname
comment_id
post_id
↓
title
post_date
comment_id
↓
post_id
post_date
nickname
nickname
↓
title
post_date
Execution A
Execution B
12

NoSE
Workload
Query Plans
Data Model
1. Candidate Enumeration
13
2. Query Planning
3. Design Optimization
4. Plan Recommendation
Database Design

14
3
4 5
Database Design Optimization
NoSE considers all
possible query plans
and picks the one
with minimum
expected cost

Evaluation
15
Overall workload performance
improves by 5x

Physical
Logical
17
{user_id: 1, post_date: "2017-04-05",
com_id: 3, …}
{user_id: 2, post_date: "2017-04-05",
com_id: 7, …}
{post_id: 6, com_date: "2017-04-03",
com_id: 3, user_id: 1, …}
{post_id: 6, com_date: "2017-04-01",
com_id: 7, user_id: 2, …}
?
Existing NoSQL designs are a black box
?!?
JSON!

Removes redundancy implied by both
functional and inclusion dependencies
Recovering
Logical
Schemas
Extract the structure of existing data
Discover dependencies
Produce a logical model of the database
18

user_comments
{░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, …}
{░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, …}
comments_by_date
{░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, ░░░░░░░: ░, …}
{░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, ░░░░░░░: ░, …}
We want to go from raw data
to a logical model
Comment
User
Post
19 [MS, ER ‘18] (to appear)

20
user_comments
user_id post_date com_id post_id title
1 2017-04-05 3 6 Stargate
2 2017-04-05 7 6 Stargate
Data on the same logical entity
appears multiple times

user_comments
user_id com_id post_id
1 3 6
2 7 6
posts
post_date post_id title
2017-04-05 6 Stargate
21
Post data can be
(logically) extracted
to normalize

22
user_comments_user
user_id
user_comments_post
post_id
post_date,
title
comments_by_date_post
post_id
comments_by_date_com
com_id
com_date, text
comments_by_date_user
user_id, nickname22

2323
posts
post_id
post_date,
title
comments
com_id
com_date, text
users
user_id, nickname
This is the original logical model!
Comment
User
Post

Apache
Spark
Model
▸ Series of lazy transformations which
are followed by actions that force
evaluation of all transformations
▸ Each step produces a resilient
distributed dataset (RDD)
▸ Intermediate results can be cached on
memory or disk, optionally serialized
25

Caching is very useful for applications that re-use an RDD multiple times.
Caching all of the generated RDDs is not a good strategy…
Caching is very useful for applications that re-use an RDD multiple times.
Caching all of the generated RDDs is not a good strategy…
…deciding which ones to cache may be challenging.
Spark Caching Best Practices
Source: https://unraveldata.com/to-cache-or-not-to-cache/26

PageRank Example
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
while (iteration < numIter) {
rankGraph.persist()
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
.persist()
rankGraph.edges.foreachPartition(...)
prevRankGraph.unpersist()
}
rankGraph.vertices.values.sum()
27

Transformations
var rankGraph = graph
var iteration = 0
rankGraph.persist()
val rankUpdates = rankGraph
rankGraph = rankGraph
.persist()
}
.outerJoinVertices(...).map(...)
.aggregateMessages(...)
.outerJoinVertices(rankUpdates)
28

Actions
var iteration = 0
rankGraph.persist()
.persist()
}
29

30
PageRank RDDs
Some RDDs are used more than once

Spark
Model
Caching
var iteration = 0
}
rankGraph.persist()
.persist()
31

ReSpark
var iteration = 0
whileLoop (sc, iteration < numIter {
rankGraph.persist()
.persist()
})
32

ReSpark
var iteration = 0
})
33
rankGraph: 0

ReSpark
var iteration = 0
})
34
rankGraph: 2 Persist!

PageRank
on
ReSpark
35
Without any caching,
many jobs take hours!

Physical Design for Non-Relational Data Systems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Physical Design for Non-Relational Data Systems

Similaire à Physical Design for Non-Relational Data Systems (20)

Plus de Michael Mior

Plus de Michael Mior (6)

Dernier

Dernier (20)

Physical Design for Non-Relational Data Systems