Database theory and modeling

•Download as PPTX, PDF•

2 likes•609 views

Elizabeth Smith

A crash course on basic database modeling, choosing the right database, optimization, and refactoring

Internet

DATABASE
THEORY AND
MODELING
A C R A S H C O U R S E D B A H O W TO

WHAT IS A DBA?
DEVELOPMENT
• Capacity Planning
• Database Design
• Database Implementation
• Migration
OPERATIONS
• Installation
• Configuration
• Monitoring
• Security and Access Management
• Troubleshooting
• Backup and Recovery

DATABASE THEORY
THE STUDY OF DATABASES AND DATA MANAGEMENT SYSTEMS
• Finite Model Theory
• Database Design Theory
• Dependency Theory
• Concurrency Control
• Deductive Databases
• Temporal and Spatial Databases
• Uncertain Data
• Query Languages and Logic

DATA MODELING
T U R N I N G B U S I N E S S R E Q U I R E M E N T S I N TO
D ATA R O A D M A P S

REASONS FOR MODELING DATA
WHAT?
• Provide a definition of our data
• Provide a format for our data
WHY?
• Compatibility of data
• Lower cost to build, operate, and
maintain systems

THREE KINDS OF DATA MODEL
INSTANCES
• Conceptual Data Model
• Logical (External) Data Model
• Physical Data Model

CONCEPTUAL MODEL
• Entities that comprise your data
• Creating data objects
• Identifying any relationships between objects
• "Business Requirements"

PROJECT SCOPE
MY BUSINESS REQUIREMENTS
• I have a lot of video games
• I want a simple way to be able to find my video games by keywords
• And keep track of what system they are for
• And keep track of when I last played them and when someone else played them
• And keep track of if I beat them, and my kids too

CONCEPTUAL MODEL
Game
• Name
• System
Keywords
• Categories
• Type
Player
• Date
• Completed

LOGICAL MODEL – FLAT MODEL
Game Title System Liz Last Play Pat Last
Play
Liz
Complete
Pat
Complete
Keywords
FFX PS2 2016-05-01 2016-06-04 Yes No fantasy, jrpg
Chrono
Trigger
PS1 2014-07-05 Yes No jrpg
Forza 4 Xbox360 2017-03-02 No No racing

HIERARCHICAL MODEL
Games
Xbox360
Forza4
PS1
Chrono Trigger
PS2
Final Fantasy X

RELATIONAL MODEL
• I have a system
• I have a game
• I have a player
• Each game has one system, each system can have many games
• Games can have many players, each player can have additional information

DATA STORAGE
C H O O S I N G D ATA B A S E S O F T WA R E

RELATIONAL DATABASES
• Relate (Link) different bits of data to each other
• Very reliable places to store data
• MUST be ACID

ACID COMPLIANCE
• Atomicity
• Consistency
• Isolation
• Durability

DOCUMENT DATABASES
• Schemaless
• Good Performance
• Speedy and Distributed
• Consistency model is BASE
• Graph Databases are Document Databases with relationships added for traversal

BASE
• Basic Availability
• Soft State
• Eventually Consistent

DATA WAREHOUSES
• A place to aggregate and store data for reporting and analysis
• ETL
– Extract
– Transform
– Load
• Data Mart (single subject area)
• OLAP (Online analytical Processing)
• OLTP (Online transaction Processing)

SCALING DATABASES
• Horizontal (Distributed)
• Vertical
• Read Replicas
• Multi-Master
• Partioning (Sharding)

CAP THEOREM
Consistency
Availability
Fault
Tolerance

CHOOSE… WISELY
• Politics will factor into this!
• You don't have to pick just one
• Choose the right solution for the right problem
• With so much available in cloud services and the ease of using containers, spinning up
lightweight places for redis to use in addition to your Postgresql server is not more
expensive!

NORMALIZATION
O R G A N I Z I N G I N TO TA B L E S A N D C O L U M N S

NO MORE ANOMALIES
• Update Anomaly
• Insertion Anomaly
• Deletion Anomaly
• Fidelity Anomaly

NO DUPLICATED DATA
MINIMIZE REDESIGN ON EXTENSION
• Store all data in only one place
• What happens if I add an additional family member I want to track in my application
• The normalized version makes this simple

FIRST NORMAL FORM
1NF
• Has a Primary Key – can be a COMPOUND key
• Has only atomic values
• Has no repeated columns

SECOND NORMAL FORM
2NF
• Table is 1NF AND
• All non-key columns are PK dependent

THIRD NORMAL FORM
3NF
• 2NF PLUS
• No Transitively dependent attributes

BUT WAIT – THERE'S MORE!
• 7 more to be exact
• They're not really that useful in most situations
• You can learn about them from Wikipedia!

DENORMALIZATION
• Wait – didn't you just say to normalize things?
• Usually has one purpose, increased performance, and should be use sparingly
• Doesn't have to be "full" denormalization
– Storing count totals of many elements in a relationship
– star schema "fact-dimension" models
– prebuilt summarizations

RELATIONSHIPSC A R D I N A L I T Y B E T W E E N A L L T H E T H I N G S

TYPES OF RELATIONS
• One to One
• One to Many
• Many to Many

TYPES OF KEYS
• Natural Key
• Alternate Key
• GUID (UUID)

OPTIMIZATION
M A K E I T G O FA S T. . E R

PHYSICS MATTERS
• Make sure you have enough hardware
• Tune your I/O
– Block and Stripe size allocation for RAID configuration
– Transaction logs in the right spot
– Frequently joined tables on separate discs
• Tune your network protocols
• Adjust cache sizes

UPDATE ALL THE THINGS
• Update your operating system
• Update your db software
• Update your communications protocols

TUNE YOUR SYSTEMS
• Check your vendor for configuration tuning
• Perform your recommended maintenance tasks

REFACTORING
DATA
M O V I N G S T U F F A R O U N D S U C K S
D O I T A N Y WAY

REFERENTIAL INTEGRITY
REFACTORING
• Add constraints
• Remove constraints
• Add Hard Delete
• Add Soft Delete
• Add Trigger for Calculated Column
• Add Trigger for History
• Add Indexes

DATA QUALITY REFACTORING
• Add lookup table
• Apply Standard codes
• Apply Standard Type
• Add a column constraint
• Introduce common format

STRUCTURAL REFACTORING
• Add a new element
• Delete an existing element
• Merge elements
• Change association types
• Split elements

ARCHITECTURE REFACTORING
• Replace a method with a view
• Add a calculation method
• Encapsulate a table with a view
• Add a mirror table
• Add a read only table

LEARNING MORE
• Free University Courses
– Databases are one thing colleges get RIGHT
– MIT, Stanford, and others have great database theory classes
– Warning, many use python – it won't kill you
• Books
– http://web.cecs.pdx.edu/~maier/TheoryBook/TRD.html - The Theory of Relational Databases
– https://www.amazon.com/Database-Design-Relational-Theory-Normal/dp/1449328016 -
Database Design and Relational Theory
– http://databaserefactoring.com/ Database refactoring

CONTACT
• auroraeosrose@gmail.com
• @auroraeosrose
• http://emsmith.net
• http://github.com/auroraeosros
e
• Freenode
• #phpwomen
• #phpmentoring
• #php-gtk

What's hot

Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Alex Gorbachev

Execution Plans: What Can You Do With ThemGrant Fritchey

SqlDay 2018 - Brief introduction into SQL Server Execution PlansMarek Maśko

Introduction to Machine Learning for Oracle Database ProfessionalsAlex Gorbachev

Unit Testing SQL ServerGiovanni Scerra ☃

Incredible ODI tips to work with Hyperion tools that you ever wanted to knowRodrigo Radtke de Souza

Essbase Statistics DW: How to Automatically Administrate Essbase Using ODIRodrigo Radtke de Souza

Got documents - The Raven Bouns EditionMaggie Pint

Stardog 1.1: An Easier, Smarter, Faster RDF Databasekendallclark

PostgreSQL Rocks IndonesiaPGConf APAC

Intro to Python for C# DevelopersSarah Dutkiewicz

PostgreSQL at 20TB and BeyondChris Travers

Scaling the Web: Databases & NoSQLRichard Schneeman

Introduction to Memory ContextsChris Travers

Make Text Search "Work" for Your Apps - JavaOne 2013javagroup2006

Optimize Performance and ScalabilityZoomdata

Query Any Data by Wayne EckersonZoomdata

A data driven etl test framework sqlsat madisonTerry Bunio

Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011Michael McIntosh

The Rise of NoSQL and Polyglot PersistenceAbdelmonaim Remani

What's hot (20)

Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...

Execution Plans: What Can You Do With Them

SqlDay 2018 - Brief introduction into SQL Server Execution Plans

Introduction to Machine Learning for Oracle Database Professionals

Unit Testing SQL Server

Incredible ODI tips to work with Hyperion tools that you ever wanted to know

Essbase Statistics DW: How to Automatically Administrate Essbase Using ODI

Got documents - The Raven Bouns Edition

Stardog 1.1: An Easier, Smarter, Faster RDF Database

PostgreSQL Rocks Indonesia

Intro to Python for C# Developers

PostgreSQL at 20TB and Beyond

Scaling the Web: Databases & NoSQL

Introduction to Memory Contexts

Make Text Search "Work" for Your Apps - JavaOne 2013

Optimize Performance and Scalability

Query Any Data by Wayne Eckerson

A data driven etl test framework sqlsat madison

Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011

The Rise of NoSQL and Polyglot Persistence

Similar to Database theory and modeling

Got documents Code Mash RevisionMaggie Pint

Build a modern data platform.pptxIke Ellis

Data Modeling for NoSQLTony Tam

RevisionDavid Sherlock

Data Ingestion EngineAdam Doyle

Binder1.pdfRanumBagaskoro

Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Lucas Jellema

Relational and non relational database 7abdulrahmanhelan

Building better SQL Server DatabasesColdFusionConference

NoSqlGirish Khanzode

UNIT I Introduction to NoSQL.pptxRahul Borate

Fusion 3 Overview Webinar Lucidworks

30334823 my sql-cluster-performance-tuning-best-practicesDavid Dhavan

UNIT I Introduction to NoSQL.pptxRahul Borate

Colorado Springs Open Source Hadoop/MySQL David Smelker

How and why you need to build a big data labChris Kernaghan

Data modeling trends for analyticsIke Ellis

Taming the resource tigerElizabeth Smith

BigData, NoSQL & ElasticSearchSanura Hettiarachchi

No SQLThe lazy hoplite

Similar to Database theory and modeling (20)

Got documents Code Mash Revision

Build a modern data platform.pptx

Data Modeling for NoSQL

Revision

Data Ingestion Engine

Binder1.pdf

Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...

Relational and non relational database 7

Building better SQL Server Databases

NoSql

UNIT I Introduction to NoSQL.pptx

Fusion 3 Overview Webinar

30334823 my sql-cluster-performance-tuning-best-practices

UNIT I Introduction to NoSQL.pptx

Colorado Springs Open Source Hadoop/MySQL

How and why you need to build a big data lab

Data modeling trends for analytics

Taming the resource tiger

BigData, NoSQL & ElasticSearch

No SQL

Recently uploaded

INDIVIDUAL ASSIGNMENT #3 CBG, PRESENTATION.CarlotaBedoya1

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh

Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.soniya singh

'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC

All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi

Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5

VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa

Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.soniya singh

Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe

Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝soniya singh

Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson

On Starlink, presented by Geoff Huston at NZNOG 2024APNIC

Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

How is AI changing journalism? (v. April 2024)Damian Radcliffe

Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls

Recently uploaded (20)

INDIVIDUAL ASSIGNMENT #3 CBG, PRESENTATION.

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝

Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.

'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...

All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445

Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...

VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...

Networking in the Penumbra presented by Geoff Huston at NZNOG

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service

Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.

Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Moving Beyond Twitter/X and Facebook - Social Media for local news providers

Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝

Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance

GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web

On Starlink, presented by Geoff Huston at NZNOG 2024

Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance

How is AI changing journalism? (v. April 2024)

Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance

Best VIP Call Girls Noida Sector 75 Call Me: 8448380779

Database theory and modeling

1. DATABASE THEORY AND MODELING A C R A S H C O U R S E D B A H O W TO

2. WHAT IS A DBA? DEVELOPMENT • Capacity Planning • Database Design • Database Implementation • Migration OPERATIONS • Installation • Configuration • Monitoring • Security and Access Management • Troubleshooting • Backup and Recovery

3. DATABASE THEORY THE STUDY OF DATABASES AND DATA MANAGEMENT SYSTEMS • Finite Model Theory • Database Design Theory • Dependency Theory • Concurrency Control • Deductive Databases • Temporal and Spatial Databases • Uncertain Data • Query Languages and Logic

4. DATA MODELING T U R N I N G B U S I N E S S R E Q U I R E M E N T S I N TO D ATA R O A D M A P S

5. REASONS FOR MODELING DATA WHAT? • Provide a definition of our data • Provide a format for our data WHY? • Compatibility of data • Lower cost to build, operate, and maintain systems

6. THREE KINDS OF DATA MODEL INSTANCES • Conceptual Data Model • Logical (External) Data Model • Physical Data Model

7. CONCEPTUAL MODEL • Entities that comprise your data • Creating data objects • Identifying any relationships between objects • "Business Requirements"

8. PROJECT SCOPE MY BUSINESS REQUIREMENTS • I have a lot of video games • I want a simple way to be able to find my video games by keywords • And keep track of what system they are for • And keep track of when I last played them and when someone else played them • And keep track of if I beat them, and my kids too

9. CONCEPTUAL MODEL Game • Name • System Keywords • Categories • Type Player • Date • Completed

10. LOGICAL MODEL – FLAT MODEL Game Title System Liz Last Play Pat Last Play Liz Complete Pat Complete Keywords FFX PS2 2016-05-01 2016-06-04 Yes No fantasy, jrpg Chrono Trigger PS1 2014-07-05 Yes No jrpg Forza 4 Xbox360 2017-03-02 No No racing

11. HIERARCHICAL MODEL Games Xbox360 Forza4 PS1 Chrono Trigger PS2 Final Fantasy X

12. RELATIONAL MODEL • I have a system • I have a game • I have a player • Each game has one system, each system can have many games • Games can have many players, each player can have additional information

13. DATA STORAGE C H O O S I N G D ATA B A S E S O F T WA R E

14. RELATIONAL DATABASES • Relate (Link) different bits of data to each other • Very reliable places to store data • MUST be ACID

15. ACID COMPLIANCE • Atomicity • Consistency • Isolation • Durability

16. DOCUMENT DATABASES • Schemaless • Good Performance • Speedy and Distributed • Consistency model is BASE • Graph Databases are Document Databases with relationships added for traversal

17. BASE • Basic Availability • Soft State • Eventually Consistent

18. DATA WAREHOUSES • A place to aggregate and store data for reporting and analysis • ETL – Extract – Transform – Load • Data Mart (single subject area) • OLAP (Online analytical Processing) • OLTP (Online transaction Processing)

19. SCALING DATABASES • Horizontal (Distributed) • Vertical • Read Replicas • Multi-Master • Partioning (Sharding)

20. CAP THEOREM Consistency Availability Fault Tolerance

21. CHOOSE… WISELY • Politics will factor into this! • You don't have to pick just one • Choose the right solution for the right problem • With so much available in cloud services and the ease of using containers, spinning up lightweight places for redis to use in addition to your Postgresql server is not more expensive!

22. NORMALIZATION O R G A N I Z I N G I N TO TA B L E S A N D C O L U M N S

23. NO MORE ANOMALIES • Update Anomaly • Insertion Anomaly • Deletion Anomaly • Fidelity Anomaly

24. NO DUPLICATED DATA MINIMIZE REDESIGN ON EXTENSION • Store all data in only one place • What happens if I add an additional family member I want to track in my application • The normalized version makes this simple

25. FIRST NORMAL FORM 1NF • Has a Primary Key – can be a COMPOUND key • Has only atomic values • Has no repeated columns

26. SECOND NORMAL FORM 2NF • Table is 1NF AND • All non-key columns are PK dependent

27. THIRD NORMAL FORM 3NF • 2NF PLUS • No Transitively dependent attributes

28. BUT WAIT – THERE'S MORE! • 7 more to be exact • They're not really that useful in most situations • You can learn about them from Wikipedia!

29. DENORMALIZATION • Wait – didn't you just say to normalize things? • Usually has one purpose, increased performance, and should be use sparingly • Doesn't have to be "full" denormalization – Storing count totals of many elements in a relationship – star schema "fact-dimension" models – prebuilt summarizations

30. RELATIONSHIPSC A R D I N A L I T Y B E T W E E N A L L T H E T H I N G S

31. TYPES OF RELATIONS • One to One • One to Many • Many to Many

32. TYPES OF KEYS • Natural Key • Alternate Key • GUID (UUID)

33. OPTIMIZATION M A K E I T G O FA S T. . E R

34. PICK TWO? Speed Small Size Correct Data

35. PHYSICS MATTERS • Make sure you have enough hardware • Tune your I/O – Block and Stripe size allocation for RAID configuration – Transaction logs in the right spot – Frequently joined tables on separate discs • Tune your network protocols • Adjust cache sizes

36. UPDATE ALL THE THINGS • Update your operating system • Update your db software • Update your communications protocols

37. TUNE YOUR SYSTEMS • Check your vendor for configuration tuning • Perform your recommended maintenance tasks

38. PROFILE YOUR CODE • Check for slow queries • Check the execution plan on the queries • Add Indexes to speed up joins • Rewrite or alter queries to make them perform faster • Create Views for a query that are indexed separately – This is best for common joins • Move routines for data manipulation into stored procedures • Create cached or denormalized versions of really slow queries

39. REFACTORING DATA M O V I N G S T U F F A R O U N D S U C K S D O I T A N Y WAY

40. REFERENTIAL INTEGRITY REFACTORING • Add constraints • Remove constraints • Add Hard Delete • Add Soft Delete • Add Trigger for Calculated Column • Add Trigger for History • Add Indexes

41. DATA QUALITY REFACTORING • Add lookup table • Apply Standard codes • Apply Standard Type • Add a column constraint • Introduce common format

42. STRUCTURAL REFACTORING • Add a new element • Delete an existing element • Merge elements • Change association types • Split elements

43. ARCHITECTURE REFACTORING • Replace a method with a view • Add a calculation method • Encapsulate a table with a view • Add a mirror table • Add a read only table

44. LEARNING MORE • Free University Courses – Databases are one thing colleges get RIGHT – MIT, Stanford, and others have great database theory classes – Warning, many use python – it won't kill you • Books – http://web.cecs.pdx.edu/~maier/TheoryBook/TRD.html - The Theory of Relational Databases – https://www.amazon.com/Database-Design-Relational-Theory-Normal/dp/1449328016 - Database Design and Relational Theory – http://databaserefactoring.com/ Database refactoring

45. CONTACT • auroraeosrose@gmail.com • @auroraeosrose • http://emsmith.net • http://github.com/auroraeosros e • Freenode • #phpwomen • #phpmentoring • #php-gtk

Editor's Notes

Wouldn't it be great if everyone had a DBA to design and manage data for you? Most places don't have this luxury, instead the burden falls on the developer. Your application is awesome, people are using it everywhere. But is your data storage designed to scale to millions of users in a way that's economical and efficient? Data modeling and theory is the process of taking your application and designing how to store and process your data in a way that won't melt down. This talk will walk through proper data modeling, choosing a data storage type, choosing database software, and architecting data relationships in your system. We'll also walk through "refactoring data" using normalization and optimization. This talk is mainly designed for people (like me) who start off developing and realize that they are not only the dev but the dba and everything else Tell a story about moving a website (in 1998) from storage in flat html files into a database and having no idea what I was doing
A DBA has a lot of hats they have to wear Knowledge of database Queries Knowledge of database theory Knowledge of database design Knowledge about the RDBMS itself, e.g. Microsoft SQL Server or MySQL Knowledge of structured query language (SQL), e.g. SQL/PSM or Transact-SQL General understanding of distributed computing architectures, e.g. Client–server model General understanding of operating system, e.g. Windows or Linux General understanding of storage technologies and networking General understanding of routine maintenance, recovery, and handling failover of a database Basically DBAs wear two hats – one that has to do with day to day maintenance and is more of an IT position – this includes tuning systems, troubleshooting, backups, etc. And then there is the design and architecture portion of being a DBA – which is generally the part a programmer gets shoved into with little or no preparation. This talk is designed to give you a crash course in the database theory and modeling portion of being a DBA, and how to make smart choices in your code
Database theory is all the ways that we store and manage data all these other things below it are parts of database theory finite model theory deals with the relation between a formal language (syntax) and its interpretations (semantics) Database design involves classifying data and identifying interrelationships. This theoretical representation of the data is called an ontology – which is the theory behind the database's design. dependency theory studies implication and optimization problems related to logical constraints, commonly called dependencies, on databases concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible. deductive database is a database system that can make deductions (i.e., conclude additional facts) based on rules and facts stored in the (deductive) database (datalog and prolog) temporal and spatial database are special types storing time data and spatial data like polygons, points, and lines uncertain data is data that contains noise that makes it deviate from the correct, intended or original values how many does the audience understand or can name?
Wait – why are we modeling our database before we pick what database software technology to use? We have a saying in my current position that answers those user questions of "would it be possible to?" Anything is possible – how useful and how much effort is involved are the more important questions Although you could make a database technology store ANY kind of data (and I've seen some pretty horrific shoehorning in my career) you and everyone else will be a lot happier if your software choices help instead of hinder what you're trying to accomplish But first, you must figure out your data What are you trying to store and how are you trying to store it? Or if this isn't a shiny greenfield project – what are you currently storing and how, then what would be the ideal way to store and access the data. yes, you can (and should!) refactor your data models! Twisting the code into knots or doing things in code the database should be doing is a recipe for down-time (story time – working on an unnamed project to protect the innocent and the guilty, I ended up writing a schema on top of a mongodb system instead of storing the data in a relational database and having the program output appropriate json stored in a cached format)
The quality of your data model can severely help or hinder your future work Business rules, specific to how things are done in a particular place, are often fixed in the structure of a data model. This means that small changes in the way business is conducted lead to large changes in computer systems and interfaces Data models for different systems are arbitrarily different. The result of this is that complex interfaces are required between systems that share data. These interfaces can account for between 25-70% of the cost of current systems Data cannot be shared electronically with customers and suppliers, because the structure and meaning of data has not been standardized. For example, engineering design data and drawings for process plant are still sometimes exchanged on paper Another story about us currently dealing with this structure and meaning of data problem – the people running the machines on the floor expect different things from the cnc programmers who expect different things from the engineers. We're currently working on bundling all the data in electronic format needed for each step of the process in a data structure that is defined and standardized
Although this is not the ONLY way to do things, it is a very GOOD way to do things This idea of 3 levels of architecture originated in the 1970s American National Standards Institute. 1975. ANSI/X3/SPARC Study Group on Data Base Management Systems; Interim Report. yes, sparc, you heard right I'll talk about this later – but database theory hasn't really changed a lot – the basic mathematical and logical theories underlying databases and how they work haven't changed Only our implementations on these theories has changed Are your brains bleeding yet? Let's get a little more hands on
Creating a conceptual model of your data can be the most difficult part of any process Often you're asked to do this when you're not the "domain owner" This is not your data and you don’t quite know what people do with it The BEST way to get this information is to ASK, and then to LISTEN (and write stuff down) Drawing pictures works well to – simple diagrams help people understand
So this is a pretty basic place to start In my "concept" I have a list of concrete things (a video game) and I want to be able to keep track of information about these games So this is my basic concept,
So I have a conceptual model of my games – the game has information about it like a name and the system it's played on The game also has some keywords I can use for searching – like a game category such as rpg or a play style type such as first person Then I want to collect information about playing the game – the player name, the last date they played, if the game was completed or not After the conceptual model for the data is found we need to turn this into a logical model
So the logical model is a method of mapping this stuff into what we expect And anyone who has ever had to deal with any type of businesses knows their favorite method of storing data Excel! Because a spreadsheet is the BEST way of storing data right? In this case we're starting with just a flat model – a way of representing stuff in a straightforward way But, this usually doesn't work really well First of all, we have a spot where there is no information – I hate racing games and first person shooters – Patrick is not as gung ho about jrpgs So any rows with those kinds of games will have "empty" columns That's not very smart Part of transitioning our conceptual model to our logical model involves dealing with relationships But what kind of relationships are most important for our data? Well there's one I see right now…
So all the games do have the advantage of being group by systems. So I could do a hierarchical model of that But that doesn't really work that fantastically does it? Although it does give me an idea of what kind of data I have but remember, some times of data are not a hierarchy Some types of data are not flat
Some type of data are not relational, but in this case my data IS relational data means you have things that – well – have a relationship with each other
so we have an idea of the type of data we want to collect – how do we make a decision on what to use?
so relational databases are the oldies but goodies originally proposed by proposed by E. F. Codd in 1970 almost all dbs use sql for querying and maintaining the db
intended to guarantee validity even in the event of errors, power failures, etc. In the context of databases, a sequence of database operations that satisfies the ACID properties, and thus can be perceived as a single logical operation on the data, is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction. Atomicity Transactions are often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds completely, or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors and crashes. Consistency Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This prevents database corruption by an illegal transaction, but does not guarantee that a transaction is correct. Isolation Transactions are often executed concurrently (e.g., reading and writing to multiple tables at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. Isolation is the main goal of concurrency control; depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions. Durability Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions (or their effects) are recorded in non-volatile memory.
designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. Document-oriented databases are one of the main categories of NoSQL databases, and the popularity of the term "document-oriented database" has grown[1] with the use of the term NoSQL itself. XML databases are a subclass of document-oriented databases that are optimized to work with XML documents. Graph databases are similar, but add another layer, the relationship, which allows them to link documents for rapid traversal. Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept. The difference lies in the way the data is processed; in a key-value store, the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on internal structure in the document in order to extract metadata that the database engine uses for further optimization.
For many domains and use cases, ACID transactions are far more pessimistic (i.e., they’re more worried about data safety) than the domain actually requires. although some databases are starting to bring some of the features of rdbm's (schemas and acid compliance) – there's a tradeoff in speed for that ;) Basic Availability The database appears to work most of the time. Soft-state Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time. Eventual consistency Stores exhibit consistency at some later point (e.g., lazily at read time). Given BASE’s loose consistency, developers need to be more knowledgeable and rigorous about consistent data if they choose a BASE store for their application. It’s essential to be familiar with the BASE behavior of your chosen aggregate store and work within those constraints.On the other hand, planning around BASE limitations can sometimes be a major disadvantage when compared to the simplicity of ACID transactions. A fully ACID database is the perfect fit for use cases where data reliability and consistency are essential.
is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place[2] that are used for creating analytical reports for workers throughout the enterprise.[3] The typical Extract, transform, load (ETL)-based data warehouse[4] uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data. OLAP databases store aggregated, historical data in multi-dimensional schemas (usually star schemas). OLAP systems typically have data latency of a few hours, as opposed to data marts, where latency is expected to be closer to one day. The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are : Roll-up (Consolidation), Drill-down and Slicing & Dicing. OLTP systems emphasize very fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP databases contain detailed and current data. Benefits Integrate data from multiple sources into a single database and data model. More congregation of data to single database so a single query engine can be used to present data in an ODS. Mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long-running, analysis queries in transaction processing databases. Maintain data history, even if the source transaction systems do not. Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger. Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data. Present the organization's information consistently. Provide a single common data model for all data of interest regardless of the data's source. Restructure the data so that it makes sense to the business users. Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems. Add value to operational business applications, notably customer relationship management (CRM) systems. Make decision–support queries easier to write. Organize and disambiguate repetitive data.[7]
Read replicas allow data to be available for reading across any number of servers, called “slaves”. One server remains the “master” and accepts any incoming write requests, along with read requests. This technique is common for relational databases, as most vendors support replication of data to multiple read-only servers. The more read replicas installed, the more read-based queries may be scaled. While the read replica technique allows for scaling out reads, what happens if you need to scale out to a large number of writes as well? The multi-master technique may be used to allow any client to write data to any database server. This enables all read replicas to be a master rather than just slaves. This enables applications to scale out the number of reads and writes. However, this also requires that our applications generate universally unique identifiers, also known as “UUIDs”, or sometimes referring to as globally unique identifiers or “GUIDs”. Otherwise, two rows in the same table on two different servers might result in the same ID, causing a data collision during the multi-master replication process. Very large data sets often produce so much data that any one server cannot access or modify the data by itself without severly impacting scale and performance. This kind of problem cannot be solved through read replicas or multi-master designs. Instead, the data must be separated in some way to allow it to be easily accessible. Horizontal partitioning, also called “sharding”, distributes data across servers. Data may be partioned to different server(s) based on a specific customer/tenant, date range, or other sharding scheme. Vertical partioning separates the data associated to a single table and groups it into frequently accessed and rarely accessed. The pattern chosen allows for the database and database cache to manage less information at once. In some cases, data patterns may be selected to move data across multiple filesystems for parallel reading and therefore increased performance. GDPR
Brewer's theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:[1][2][3] Consistency: Every read receives the most recent write or an error Availability: Every request receives a (non-error) response – without guarantee that it contains the most recent write Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes Think of this as being a riff on "fast/cheap/good" you get two! Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability, whereas systems designed around the BASE philosophy, common in the NoSQL movement for example, choose availability over consistency.[6]
There are lots of choices that have come into play that are beyond just the technical considerations. Price, availability, what your CEO read in the magazine last week will all contribute to this. Can your IT department install this? mysql does a middling job of everything except being easy to install and administrate
So – let's talk about normalizing data normalizing data has a couple purposes but is not the be all end all of databases generally however, normalization SOLVES more problems than it creates
Basically normalization exists to help get rid of anomalies in data This means that the data is the same for all things in all places, and we aren't storing duplication AND POSSIBLY INCORRECT data What if you spell checking with ck in one name and que in another? What if Patrick moves out and I remove all his game data from my database, except for 50 rows I forgot?
This may seem to be a small thing, but small data can build up over time and take up lots more space than you'd expect! It really is designed to decrease the amount of pain and suffering when iterating on the design of the database
So how would we structure my database application? atomic values basically means you're storing only ONE value – so you can't do two telephone numbers in a telephone column now the atomic thing is rather interesting since one could argue that dates or strings can be "decomposed" which is the definition of atomic – basically atomic is used in current form to mean "not xml or json or some other representation of complex data" … or simply ignored
This basically means that every table should be related to the primary key of the first table Partial dependencies are removed, i.e., all non key attributes are fully functional dependent on the primary key. In other words, nonkey attributes cannot depend on a subset of the primary key.
"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key." "so help me Codd".[8] - That's Edgar F Codd who invented relational database management while working for IBM Requiring existence of "the key" ensures that the table is in 1NF; requiring that non-key attributes be dependent on "the whole key" ensures 2NF; further requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF.
And…. no one cares
Now that I've preached on how to normalize databases, I'm going to tell you it's perfectly fine to denormalize AFTER you've normalized and AS NEEDED you may find that one or two queries or tables constitutes most of your speed problems and judicious use of denormalization can help
Often you'll see subsets of this as zero or one, only one, one to zero or many you should be connecting tables that represent entity types many to many relations are generally done using an association table – the relationship becomes an entity in a table linking them together
states two letter abbreviation
So on to the last part of being a dba – that usually comes after you have stuff written You have to optimize it! but what does it mean to "optimize" your database What does "fast" mean for a database? the answer is always – it depends Are you focused on your data always being correct? or on fast load times? or on small storage space?
As in all things you're not always going to be able to optimize for all things Usually Faster is going to mean you are storing more on disc – via caching or denormalized layout or something else Usually correct data is going to come about by making things less concurrent and more robust – more checks (hence… slower) Usually small size means you're storing as little as possible in a very optimized way, which generally means more work for your application As long as you understand the tradeoffs you can "speed things up"
No matter what you do to optimize you are going to hit physical barriers Sometimes that means "speeding up your database" means throwing more hardware at the problem There is a finite amount of processing that any system will be able to do. So the solution may be two systems instead
Most of this section tends to go to a bit of "no brainer" land You want your db to go faster? keep your software up to date those are both "easy" in theory but possibly "expensive" in truth to do But building in a cadence of upgrading systems will keep you and your users happier
Tune your database management system – that sounds "easy" as well but is made more difficult by the fact that each vendor has it's own requirements for tuning But generally this is a process of checking your vendor for best practicing and benchmarking for memory allocation, caches, concurrency settings (like reserving processors or memory) and fiddleing with network protocols maintenance tasks can involve things like vacuuming postgresql dbs or defragmentation, statistics updates, adjust the size of transaction logs and rotate and offload logging I had an sql server system running like a dog a 50gb transaction log from a migration will do that to you
This should be last on your list. And don't just guess, actually check which queries are slow. Almost every database has a way to log slow queries And most frameworks and db abstraction layers have logging and timing functionality to catch exceptionally slow queries
The biggest issue with refactoring data is the possibility of data loss so most people tend to shy away from large data refactors EVEN if a data refactor would cut their code in half This is a fallacy – think about the word refactoring – it's a small change to the database schema that improves it's design without changing it's semantics The #1 issue with database refactoring is COMMUNICATION BETWEEN THOSE RESPONSIBLE FOR THE CODE AND THOSE RESPONSIBLE FOR THE DATABASE code refactorings only need to maintain behavioral semantics while database refactorings also must maintain informational semantics Database refactoring does not change the way data is interpreted or used and does not fix bugs or add new functionality. Every refactoring to a database leaves the system in a working state, thus not causing maintenance lags, provided the meaningful data exists in the production environment.
These are generally some of the easiest and most effective refactors you can do on a database Discuss briefly how each thing could help with making your application better
lookup table is easy Standard code would be making sure the same country/state codes as those in a lookup table are used standard type would be making sure all phone numbers are the same sized integer make sure your column constraint gives you logical values – like age should be > 0 but less than 200 make sure all your phone numbers are stored as integers with no separator values Most of these will require two steps change the code to make sure the values are checked properly before coming in Run a migration on the data to make sure the values are correct Change the database if necessary These are also less "lossy" types of refactoring but tend to improve the quality of the data being stored
by element here I mean Table View Column this is the "hard" problems The changes that might make your code much nicer, but require a good deal of work And without tests!! and backups!! this can bite you The best thing to do in this case is make SMALL changes a little at a time AND TEST
These are generally large changes to the actual architecture of the application, not just to the relationships or the data or the structure These are changes that can have the greatest impact on performance
There are a lot of places to learn more about databases. But the really BEST way to learn is to DO play around with a new system. Think of how you'd redo your present storage mechanism if you could It might lead to actually being able to do it for real
Aurora Eos Rose is the handle I’ve had forever – greek and roman goddesses of the dawn and aurora rose from sleeping beauty

Database theory and modeling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Database theory and modeling

Similar to Database theory and modeling (20)

More from Elizabeth Smith

More from Elizabeth Smith (20)

Recently uploaded

Recently uploaded (20)

Database theory and modeling

Editor's Notes