2. WHAT IS A DBA?
DEVELOPMENT
• Capacity Planning
• Database Design
• Database Implementation
• Migration
OPERATIONS
• Installation
• Configuration
• Monitoring
• Security and Access Management
• Troubleshooting
• Backup and Recovery
3. DATABASE THEORY
THE STUDY OF DATABASES AND DATA MANAGEMENT SYSTEMS
• Finite Model Theory
• Database Design Theory
• Dependency Theory
• Concurrency Control
• Deductive Databases
• Temporal and Spatial Databases
• Uncertain Data
• Query Languages and Logic
4. DATA MODELING
T U R N I N G B U S I N E S S R E Q U I R E M E N T S I N TO
D ATA R O A D M A P S
5. REASONS FOR MODELING DATA
WHAT?
• Provide a definition of our data
• Provide a format for our data
WHY?
• Compatibility of data
• Lower cost to build, operate, and
maintain systems
6. THREE KINDS OF DATA MODEL
INSTANCES
• Conceptual Data Model
• Logical (External) Data Model
• Physical Data Model
7. CONCEPTUAL MODEL
• Entities that comprise your data
• Creating data objects
• Identifying any relationships between objects
• "Business Requirements"
8. PROJECT SCOPE
MY BUSINESS REQUIREMENTS
• I have a lot of video games
• I want a simple way to be able to find my video games by keywords
• And keep track of what system they are for
• And keep track of when I last played them and when someone else played them
• And keep track of if I beat them, and my kids too
10. LOGICAL MODEL – FLAT MODEL
Game Title System Liz Last Play Pat Last
Play
Liz
Complete
Pat
Complete
Keywords
FFX PS2 2016-05-01 2016-06-04 Yes No fantasy, jrpg
Chrono
Trigger
PS1 2014-07-05 Yes No jrpg
Forza 4 Xbox360 2017-03-02 No No racing
12. RELATIONAL MODEL
• I have a system
• I have a game
• I have a player
• Each game has one system, each system can have many games
• Games can have many players, each player can have additional information
16. DOCUMENT DATABASES
• Schemaless
• Good Performance
• Speedy and Distributed
• Consistency model is BASE
• Graph Databases are Document Databases with relationships added for traversal
18. DATA WAREHOUSES
• A place to aggregate and store data for reporting and analysis
• ETL
– Extract
– Transform
– Load
• Data Mart (single subject area)
• OLAP (Online analytical Processing)
• OLTP (Online transaction Processing)
21. CHOOSE… WISELY
• Politics will factor into this!
• You don't have to pick just one
• Choose the right solution for the right problem
• With so much available in cloud services and the ease of using containers, spinning up
lightweight places for redis to use in addition to your Postgresql server is not more
expensive!
23. NO MORE ANOMALIES
• Update Anomaly
• Insertion Anomaly
• Deletion Anomaly
• Fidelity Anomaly
24. NO DUPLICATED DATA
MINIMIZE REDESIGN ON EXTENSION
• Store all data in only one place
• What happens if I add an additional family member I want to track in my application
• The normalized version makes this simple
25. FIRST NORMAL FORM
1NF
• Has a Primary Key – can be a COMPOUND key
• Has only atomic values
• Has no repeated columns
28. BUT WAIT – THERE'S MORE!
• 7 more to be exact
• They're not really that useful in most situations
• You can learn about them from Wikipedia!
29. DENORMALIZATION
• Wait – didn't you just say to normalize things?
• Usually has one purpose, increased performance, and should be use sparingly
• Doesn't have to be "full" denormalization
– Storing count totals of many elements in a relationship
– star schema "fact-dimension" models
– prebuilt summarizations
30. RELATIONSHIPSC A R D I N A L I T Y B E T W E E N A L L T H E T H I N G S
35. PHYSICS MATTERS
• Make sure you have enough hardware
• Tune your I/O
– Block and Stripe size allocation for RAID configuration
– Transaction logs in the right spot
– Frequently joined tables on separate discs
• Tune your network protocols
• Adjust cache sizes
36. UPDATE ALL THE THINGS
• Update your operating system
• Update your db software
• Update your communications protocols
37. TUNE YOUR SYSTEMS
• Check your vendor for configuration tuning
• Perform your recommended maintenance tasks
38. PROFILE YOUR CODE
• Check for slow queries
• Check the execution plan on the queries
• Add Indexes to speed up joins
• Rewrite or alter queries to make them perform faster
• Create Views for a query that are indexed separately
– This is best for common joins
• Move routines for data manipulation into stored procedures
• Create cached or denormalized versions of really slow queries
40. REFERENTIAL INTEGRITY
REFACTORING
• Add constraints
• Remove constraints
• Add Hard Delete
• Add Soft Delete
• Add Trigger for Calculated Column
• Add Trigger for History
• Add Indexes
41. DATA QUALITY REFACTORING
• Add lookup table
• Apply Standard codes
• Apply Standard Type
• Add a column constraint
• Introduce common format
42. STRUCTURAL REFACTORING
• Add a new element
• Delete an existing element
• Merge elements
• Change association types
• Split elements
43. ARCHITECTURE REFACTORING
• Replace a method with a view
• Add a calculation method
• Encapsulate a table with a view
• Add a mirror table
• Add a read only table
44. LEARNING MORE
• Free University Courses
– Databases are one thing colleges get RIGHT
– MIT, Stanford, and others have great database theory classes
– Warning, many use python – it won't kill you
• Books
– http://web.cecs.pdx.edu/~maier/TheoryBook/TRD.html - The Theory of Relational Databases
– https://www.amazon.com/Database-Design-Relational-Theory-Normal/dp/1449328016 -
Database Design and Relational Theory
– http://databaserefactoring.com/ Database refactoring
Wouldn't it be great if everyone had a DBA to design and manage data for you? Most places don't have this luxury, instead the burden falls on the developer. Your application is awesome, people are using it everywhere. But is your data storage designed to scale to millions of users in a way that's economical and efficient? Data modeling and theory is the process of taking your application and designing how to store and process your data in a way that won't melt down. This talk will walk through proper data modeling, choosing a data storage type, choosing database software, and architecting data relationships in your system. We'll also walk through "refactoring data" using normalization and optimization.
This talk is mainly designed for people (like me) who start off developing and realize that they are not only the dev but the dba and everything else
Tell a story about moving a website (in 1998) from storage in flat html files into a database and having no idea what I was doing
A DBA has a lot of hats they have to wear
Knowledge of database Queries
Knowledge of database theory
Knowledge of database design
Knowledge about the RDBMS itself, e.g. Microsoft SQL Server or MySQL
Knowledge of structured query language (SQL), e.g. SQL/PSM or Transact-SQL
General understanding of distributed computing architectures, e.g. Client–server model
General understanding of operating system, e.g. Windows or Linux
General understanding of storage technologies and networking
General understanding of routine maintenance, recovery, and handling failover of a database
Basically DBAs wear two hats – one that has to do with day to day maintenance and is more of an IT position – this includes tuning systems, troubleshooting, backups, etc.
And then there is the design and architecture portion of being a DBA – which is generally the part a programmer gets shoved into with little or no preparation.
This talk is designed to give you a crash course in the database theory and modeling portion of being a DBA, and how to make smart choices in your code
Database theory is all the ways that we store and manage data all these other things below it are parts of database theory
finite model theory deals with the relation between a formal language (syntax) and its interpretations (semantics)
Database design involves classifying data and identifying interrelationships. This theoretical representation of the data is called an ontology – which is the theory behind the database's design.
dependency theory studies implication and optimization problems related to logical constraints, commonly called dependencies, on databases
concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible.
deductive database is a database system that can make deductions (i.e., conclude additional facts) based on rules and facts stored in the (deductive) database (datalog and prolog)
temporal and spatial database are special types storing time data and spatial data like polygons, points, and lines
uncertain data is data that contains noise that makes it deviate from the correct, intended or original values
how many does the audience understand or can name?
Wait – why are we modeling our database before we pick what database software technology to use?
We have a saying in my current position that answers those user questions of "would it be possible to?"
Anything is possible – how useful and how much effort is involved are the more important questions
Although you could make a database technology store ANY kind of data (and I've seen some pretty horrific shoehorning in my career) you and everyone else will be a lot happier
if your software choices help instead of hinder what you're trying to accomplish
But first, you must figure out your data
What are you trying to store and how are you trying to store it?
Or if this isn't a shiny greenfield project – what are you currently storing and how, then what would be the ideal way to store and access the data.
yes, you can (and should!) refactor your data models! Twisting the code into knots or doing things in code the database should be doing is a recipe for down-time
(story time – working on an unnamed project to protect the innocent and the guilty, I ended up writing a schema on top of a mongodb system instead of storing the data in a relational database and having the
program output appropriate json stored in a cached format)
The quality of your data model can severely help or hinder your future work
Business rules, specific to how things are done in a particular place, are often fixed in the structure of a data model. This means that small changes in the way business is conducted lead to large changes in computer systems and interfaces
Data models for different systems are arbitrarily different. The result of this is that complex interfaces are required between systems that share data. These interfaces can account for between 25-70% of the cost of current systems
Data cannot be shared electronically with customers and suppliers, because the structure and meaning of data has not been standardized. For example, engineering design data and drawings for process plant are still sometimes exchanged on paper
Another story about us currently dealing with this structure and meaning of data problem – the people running the machines on the floor expect different things from the cnc programmers who expect different things from the engineers. We're currently working on bundling all the data in electronic format needed for each step of the process in a data structure that is defined and standardized
Although this is not the ONLY way to do things, it is a very GOOD way to do things
This idea of 3 levels of architecture originated in the 1970s
American National Standards Institute. 1975. ANSI/X3/SPARC Study Group on Data Base Management Systems; Interim Report.
yes, sparc, you heard right
I'll talk about this later – but database theory hasn't really changed a lot – the basic mathematical and logical theories underlying databases and how they work haven't changed
Only our implementations on these theories has changed
Are your brains bleeding yet? Let's get a little more hands on
Creating a conceptual model of your data can be the most difficult part of any process
Often you're asked to do this when you're not the "domain owner"
This is not your data and you don’t quite know what people do with it
The BEST way to get this information is to ASK, and then to LISTEN (and write stuff down)
Drawing pictures works well to – simple diagrams help people understand
So this is a pretty basic place to start
In my "concept" I have a list of concrete things (a video game) and I want to be able to keep track of information about these games
So this is my basic concept,
So I have a conceptual model of my games – the game has information about it like a name and the system it's played on
The game also has some keywords I can use for searching – like a game category such as rpg or a play style type such as first person
Then I want to collect information about playing the game – the player name, the last date they played, if the game was completed or not
After the conceptual model for the data is found we need to turn this into a logical model
So the logical model is a method of mapping this stuff into what we expect
And anyone who has ever had to deal with any type of businesses knows their favorite method of storing data
Excel! Because a spreadsheet is the BEST way of storing data right?
In this case we're starting with just a flat model – a way of representing stuff in a straightforward way
But, this usually doesn't work really well
First of all, we have a spot where there is no information – I hate racing games and first person shooters – Patrick is not as gung ho about jrpgs
So any rows with those kinds of games will have "empty" columns
That's not very smart
Part of transitioning our conceptual model to our logical model involves dealing with relationships
But what kind of relationships are most important for our data? Well there's one I see right now…
So all the games do have the advantage of being group by systems.
So I could do a hierarchical model of that
But that doesn't really work that fantastically does it? Although it does give me an idea of what kind of data I have
but remember, some times of data are not a hierarchy
Some types of data are not flat
Some type of data are not relational, but in this case my data IS
relational data means you have things that – well – have a relationship with each other
so we have an idea of the type of data we want to collect – how do we make a decision on what to use?
so relational databases are the oldies but goodies
originally proposed by proposed by E. F. Codd in 1970
almost all dbs use sql for querying and maintaining the db
intended to guarantee validity even in the event of errors, power failures, etc. In the context of databases, a sequence of database operations that satisfies the ACID properties, and thus can be perceived as a single logical operation on the data, is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction.
Atomicity
Transactions are often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds completely, or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors and crashes.
Consistency
Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This prevents database corruption by an illegal transaction, but does not guarantee that a transaction is correct.
Isolation
Transactions are often executed concurrently (e.g., reading and writing to multiple tables at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. Isolation is the main goal of concurrency control; depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions.
Durability
Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions (or their effects) are recorded in non-volatile memory.
designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. Document-oriented databases are one of the main categories of NoSQL databases, and the popularity of the term "document-oriented database" has grown[1] with the use of the term NoSQL itself. XML databases are a subclass of document-oriented databases that are optimized to work with XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal.
Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store, the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document in order to extract metadata that the database engine uses for further optimization.
For many domains and use cases, ACID transactions are far more pessimistic (i.e., they’re more worried about data safety) than the domain actually requires.
although some databases are starting to bring some of the features of rdbm's (schemas and acid compliance) – there's a tradeoff in speed for that ;)
Basic Availability
The database appears to work most of the time.
Soft-state
Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time.
Eventual consistency
Stores exhibit consistency at some later point (e.g., lazily at read time).
Given BASE’s loose consistency, developers need to be more knowledgeable and rigorous about consistent data if they choose a BASE store for their application. It’s essential to be familiar with the BASE behavior of your chosen aggregate store and work within those constraints.On the other hand, planning around BASE limitations can sometimes be a major disadvantage when compared to the simplicity of ACID transactions. A fully ACID database is the perfect fit for use cases where data reliability and consistency are essential.
is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place[2] that are used for creating analytical reports for workers throughout the enterprise.[3]
The typical Extract, transform, load (ETL)-based data warehouse[4] uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.
OLAP databases store aggregated, historical data in multi-dimensional schemas (usually star schemas). OLAP systems typically have data latency of a few hours, as opposed to data marts, where latency is expected to be closer to one day. The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are : Roll-up (Consolidation), Drill-down and Slicing & Dicing.
OLTP systems emphasize very fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP databases contain detailed and current data.
Benefits
Integrate data from multiple sources into a single database and data model. More congregation of data to single database so a single query engine can be used to present data in an ODS.
Mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long-running, analysis queries in transaction processing databases.
Maintain data history, even if the source transaction systems do not.
Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.
Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.
Present the organization's information consistently.
Provide a single common data model for all data of interest regardless of the data's source.
Restructure the data so that it makes sense to the business users.
Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems.
Add value to operational business applications, notably customer relationship management (CRM) systems.
Make decision–support queries easier to write.
Organize and disambiguate repetitive data.[7]
Read replicas allow data to be available for reading across any number of servers, called “slaves”. One server remains the “master” and accepts any incoming write requests, along with read requests. This technique is common for relational databases, as most vendors support replication of data to multiple read-only servers. The more read replicas installed, the more read-based queries may be scaled.
While the read replica technique allows for scaling out reads, what happens if you need to scale out to a large number of writes as well? The multi-master technique may be used to allow any client to write data to any database server. This enables all read replicas to be a master rather than just slaves. This enables applications to scale out the number of reads and writes. However, this also requires that our applications generate universally unique identifiers, also known as “UUIDs”, or sometimes referring to as globally unique identifiers or “GUIDs”. Otherwise, two rows in the same table on two different servers might result in the same ID, causing a data collision during the multi-master replication process.
Very large data sets often produce so much data that any one server cannot access or modify the data by itself without severly impacting scale and performance. This kind of problem cannot be solved through read replicas or multi-master designs. Instead, the data must be separated in some way to allow it to be easily accessible.
Horizontal partitioning, also called “sharding”, distributes data across servers. Data may be partioned to different server(s) based on a specific customer/tenant, date range, or other sharding scheme. Vertical partioning separates the data associated to a single table and groups it into frequently accessed and rarely accessed. The pattern chosen allows for the database and database cache to manage less information at once. In some cases, data patterns may be selected to move data across multiple filesystems for parallel reading and therefore increased performance.
GDPR
Brewer's theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:[1][2][3]
Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response – without guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
Think of this as being a riff on "fast/cheap/good"
you get two!
Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability, whereas systems designed around the BASE philosophy, common in the NoSQL movement for example, choose availability over consistency.[6]
There are lots of choices that have come into play that are beyond just the technical considerations. Price, availability, what your CEO read in the magazine last week will all contribute to this.
Can your IT department install this?
mysql does a middling job of everything except being easy to install and administrate
So – let's talk about normalizing data
normalizing data has a couple purposes but is not the be all end all of databases
generally however, normalization SOLVES more problems than it creates
Basically normalization exists to help get rid of anomalies in data
This means that the data is the same for all things in all places, and we aren't storing duplication AND POSSIBLY INCORRECT data
What if you spell checking with ck in one name and que in another?
What if Patrick moves out and I remove all his game data from my database, except for 50 rows I forgot?
This may seem to be a small thing, but small data can build up over time and take up lots more space than you'd expect!
It really is designed to decrease the amount of pain and suffering when iterating on the design of the database
So how would we structure my database application?
atomic values basically means you're storing only ONE value – so you can't do two telephone numbers in a telephone column
now the atomic thing is rather interesting since one could argue that dates or strings can be "decomposed" which is the definition of atomic – basically atomic is used in current form to mean "not xml or json or some other representation of complex data" … or simply ignored
This basically means that every table should be related to the primary key of the first table
Partial dependencies are removed, i.e., all non key attributes are fully functional dependent on the primary key. In other words, nonkey attributes cannot depend on a subset of the primary key.
"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key." "so help me Codd".[8]
- That's Edgar F Codd who invented relational database management while working for IBM
Requiring existence of "the key" ensures that the table is in 1NF;
requiring that non-key attributes be dependent on "the whole key" ensures 2NF;
further requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF.
And…. no one cares
Now that I've preached on how to normalize databases, I'm going to tell you it's perfectly fine to denormalize
AFTER you've normalized and AS NEEDED
you may find that one or two queries or tables constitutes most of your speed problems and judicious use of denormalization can help
Often you'll see subsets of this as zero or one, only one, one to zero or many
you should be connecting tables that represent entity types
many to many relations are generally done using an association table – the relationship becomes an entity in a table linking them together
states two letter abbreviation
So on to the last part of being a dba – that usually comes after you have stuff written
You have to optimize it!
but what does it mean to "optimize" your database
What does "fast" mean for a database?
the answer is always – it depends
Are you focused on your data always being correct? or on fast load times? or on small storage space?
As in all things you're not always going to be able to optimize for all things
Usually Faster is going to mean you are storing more on disc – via caching or denormalized layout or something else
Usually correct data is going to come about by making things less concurrent and more robust – more checks (hence… slower)
Usually small size means you're storing as little as possible in a very optimized way, which generally means more work for your application
As long as you understand the tradeoffs you can "speed things up"
No matter what you do to optimize you are going to hit physical barriers
Sometimes that means "speeding up your database" means throwing more hardware at the problem
There is a finite amount of processing that any system will be able to do. So the solution may be two systems instead
Most of this section tends to go to a bit of "no brainer" land
You want your db to go faster?
keep your software up to date
those are both "easy" in theory but possibly "expensive" in truth to do
But building in a cadence of upgrading systems will keep you and your users happier
Tune your database management system – that sounds "easy" as well but is made more difficult by the fact that each vendor has it's own requirements for tuning
But generally this is a process of checking your vendor for best practicing and benchmarking for memory allocation, caches, concurrency settings (like reserving processors or memory) and fiddleing with network protocols
maintenance tasks can involve things like vacuuming postgresql dbs or defragmentation, statistics updates, adjust the size of transaction logs and rotate and offload logging
I had an sql server system running like a dog
a 50gb transaction log from a migration will do that to you
This should be last on your list. And don't just guess, actually check which queries are slow. Almost every database has a way to log slow queries
And most frameworks and db abstraction layers have logging and timing functionality to catch exceptionally slow queries
The biggest issue with refactoring data is the possibility of data loss
so most people tend to shy away from large data refactors EVEN if a data refactor would cut their code in half
This is a fallacy – think about the word refactoring – it's a small change to the database schema that improves it's design without changing it's semantics
The #1 issue with database refactoring is COMMUNICATION BETWEEN THOSE RESPONSIBLE FOR THE CODE AND THOSE RESPONSIBLE FOR THE DATABASE
code refactorings only need to maintain behavioral semantics while database refactorings also must maintain informational semantics
Database refactoring does not change the way data is interpreted or used and does not fix bugs or add new functionality. Every refactoring to a database leaves the system in a working state, thus not causing maintenance lags, provided the meaningful data exists in the production environment.
These are generally some of the easiest and most effective refactors you can do on a database
Discuss briefly how each thing could help with making your application better
lookup table is easy
Standard code would be making sure the same country/state codes as those in a lookup table are used
standard type would be making sure all phone numbers are the same sized integer
make sure your column constraint gives you logical values – like age should be > 0 but less than 200
make sure all your phone numbers are stored as integers with no separator values
Most of these will require two steps
change the code to make sure the values are checked properly before coming in
Run a migration on the data to make sure the values are correct
Change the database if necessary
These are also less "lossy" types of refactoring but tend to improve the quality of the data being stored
by element here I mean
Table
View
Column
this is the "hard" problems
The changes that might make your code much nicer, but require a good deal of work
And without tests!! and backups!! this can bite you
The best thing to do in this case is make SMALL changes a little at a time
AND TEST
These are generally large changes to the actual architecture of the application, not just to the relationships or the data or the structure
These are changes that can have the greatest impact on performance
There are a lot of places to learn more about databases. But the really BEST way to learn is to DO
play around with a new system. Think of how you'd redo your present storage mechanism if you could
It might lead to actually being able to do it for real
Aurora Eos Rose is the handle I’ve had forever – greek and roman goddesses of the dawn and aurora rose from sleeping beauty