This document discusses physical database design for non-relational data systems. It begins by stating that proper data system design and configuration is critical for performance. While tools exist to optimize relational databases, non-relational systems are increasingly important as applications require high availability, flexibility with changing data, and fast query response times. Relational databases are not always suitable for these uses. The document then outlines approaches for non-relational database design optimization, understanding existing non-relational designs, and optimizing big data applications.
2. Proper design and configuration of
data systems is critical for achieving
good performance
2
3. 3
Many tools exist for relational
database design optimization
Source: https://www.databasejournal.com/features/mssql/article.php/10894_3523616_2/Index-Tuning-Wizard.htm
https://dev.mysql.com/doc/mysql-monitor/4.0/en/mem-qanal-using-ui.html
Microsoft AutoAdmin (1998)
DB2 Design Advisor (2004)
Oracle SQL Tuning (2004)
4. We want applications to be up 24/7
We're frequently dealing with changing
data or with unstructured data
We require sub-second responses to queries
4 Source: Mike Loukides, VP Content Strategy, O’Reilly Media
Relational databases are not
always sufficient for these uses
5. “Over 30 years, we've learned how to
write business intelligence
applications on top of relational
databases -- there are patterns. With
NoSQL today, we have no cookie
cutters. We don't have any blueprints.”
--Ravi Krishnappa, NetApp solutions architect
5 Source: TechTarget, 2015
6. • NoSQL Database Design Optimization
• Understanding Existing NoSQL Designs
• Optimizing Big Data Applications
7. • NoSQL Database Design Optimization
• Understanding Existing NoSQL Designs
• Optimizing Big Data Applications
8. Model column families around query patterns
But start your design with entities and relationships, if you can
De-normalize and duplicate for read performance
But don’t de-normalize if you don’t need to
Leverage wide rows for ordering, grouping, and filtering
But don’t go too wide
Schema Design Best Practices
Source: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
But
But
But
?
?
?
8
11. Database
Design
Example
11
SELECT post_id, post_title
FROM users u JOIN comments
c ON u.user_id = c.user_id
JOIN posts p
ON p.post_id = c.post_id
ORDER BY p.post_date
Query
Find information on all posts a user has
commented on in order by post date
18. Removes redundancy implied by both
functional and inclusion dependencies
Recovering
Logical
Schemas
Extract the structure of existing data
Discover dependencies
Produce a logical model of the database
18
19. user_comments
{░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, …}
{░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, …}
comments_by_date
{░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, ░░░░░░░: ░, …}
{░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, ░░░░░░░: ░, …}
We want to go from raw data
to a logical model
Comment
User
Post
19 [MS, ER ‘18] (to appear)
21. user_comments
user_id com_id post_id
1 3 6
2 7 6
posts
post_date post_id title
2017-04-05 6 Stargate
21
Post data can be
(logically) extracted
to normalize
24. • NoSQL Database Design Optimization
• Understanding Existing NoSQL Designs
• Optimizing Big Data Applications
25. Apache
Spark
Model
▸ Series of lazy transformations which
are followed by actions that force
evaluation of all transformations
▸ Each step produces a resilient
distributed dataset (RDD)
▸ Intermediate results can be cached on
memory or disk, optionally serialized
25
26. Caching is very useful for applications that re-use an RDD multiple times.
Caching all of the generated RDDs is not a good strategy…
Caching is very useful for applications that re-use an RDD multiple times.
Caching all of the generated RDDs is not a good strategy…
…deciding which ones to cache may be challenging.
Spark Caching Best Practices
Source: https://unraveldata.com/to-cache-or-not-to-cache/26
27. PageRank Example
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
while (iteration < numIter) {
rankGraph.persist()
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
.persist()
rankGraph.edges.foreachPartition(...)
prevRankGraph.unpersist()
}
rankGraph.vertices.values.sum()
27
28. Transformations
var rankGraph = graph
var iteration = 0
while (iteration < numIter) {
rankGraph.persist()
val rankUpdates = rankGraph
prevRankGraph = rankGraph
rankGraph = rankGraph
.persist()
rankGraph.edges.foreachPartition(...)
prevRankGraph.unpersist()
}
rankGraph.vertices.values.sum()
.outerJoinVertices(...).map(...)
.aggregateMessages(...)
.outerJoinVertices(rankUpdates)
28
29. Actions
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
while (iteration < numIter) {
rankGraph.persist()
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
.persist()
prevRankGraph.unpersist()
}
rankGraph.edges.foreachPartition(...)
rankGraph.vertices.values.sum()
29