Beyond Relational Databases

Beyond relational databases
NoSQL, NewSQL, TimeSeries DB
Grégory BoissinotJanuary 2015
v3

Objectives
Understand the dominance of relational databases
Know the existence of alternative technologies for differing needs
Provide you enough background on how NoSQL databases work
Make you know the existence of others movements

Presentation Content
RDBMS
Stability
Some RDBMS problems
Unsuitable use cases with RDBMS
NoSQL
Why the emergence of this movement?
Transactions and scalability issues
NoSQL types

Relational Databases: already achievement of
maturity
Files
DB
Hierarchical
DB
Network
DB
Relational
DB
temps
1970

RDBMS (Relational Database Management System)
Classic way to store data in the world of enterprise applications
Often used for all database needs
A powerful tool used for many more decades
Providing persistence, concurrency control
Accessible from many programming languages
Mostly standard
Widely understood
The degree of standardisation is enough to keep things familiar
SQL used as an integration mechanism between applications
ACID transactions to modify multiple rows and multiple tables
Atomic,
Consistent,
Isolated, etc
Durable

RDBMS Schema & Normalization
Relational databases require an explicitly
defined schema
A schema is a specification that describes
the structure of an object
Data normalization is the process of
organizing data into tables in such way to
reduce the potential for data anomalies
(an inconsistency in the data)

Joining process
Often the need to read data from multiple tables : a join operation on the data is
performed.
The join is very easier to use in the SQL syntax
As the size of table grows, the join operation take longer as more
data blocks need to be read

RDMS - A stability for more than more decades
Stability of RDBMS
Change in
langages
Change in
architectures
temps
… 1980
Change in
platforms
Change in
processes

Some RDBMS Problems
SCALE OUT
IS HARD
(Limited scale)
RIGID
SCHEMA
IMPEDANCE
MISMATCH
BAD COST
CONTROL

Relational Model Example
Everything is normalized
No data is repeated in
multiple tables.
We have referential
integrity
RIGID
SCHEMA

Changing relational database schema
is hard
Relational model is a set of structured data: tables with tuples and relations
A tuple is a limited data structure
We can’t use List, Map
Can’t nest one tuple within another to get nested records
Promote the data normalization
No data is duplicated
We referential integrity
Data are modeled independently from their usage
Enable to think on data manipulation as operation that have
As input tuples, etc
Return tuples
RIGID
SCHEMA

A relational database used as an integration DB
Very used in 80’
For a relational database, SQL is used as an integration mechanism between applications
● Simple
● Transactional
● Triggers are available (implementation specific)
Shared database integration style

Relational databases are not designed
to run on clusters
But it’s cheaper and more effective to scale
horizontally by buying lots of machines. However it
requires DBA expertise
With relation database, for scaling
you have to buy a bigger machine
SCALE OUT
IS HARD
(with RDBMS)

Difference between the relational model
and the in-memory data structures
A lot of application development effort is spent on mapping data
between in-memory data structures and a relational databases
IMPEDANCE
MISMATCH

Tentatives for helping to map data
OODBMS ORM
(JPA, Hibernate, etc)
IBatis Spring Data jOOQ
IMPEDANCE
MISMATCH

Often difficult to control cost
with relational database
BAD COST
CONTROL
Multiple criterias
● Number of users to access database
● Number of servers
● The volume of the data

Unsuitable use cases for RDBMS
Unpredictable Data
(Accepts entry of any form and
size)
User or Session data, Log,
Sensor Data from IoT
Connected Data
Social data,
Recommendation System
Real time
Analytics
Always context dependant
Performance
Responsiveness

Why NoSQL?
A new challenger for a new world!
There's a huge demand for things other than SQL

Scalability
NoSQL favors new factors
Arrival of Internet and new
Web Application needs
● Large volume of read
and write operations
● Low Latency response
time
● High availability
Flexibility
Cost Control
Availability

Supporting large volume of data: an old objective
New use cases with huge amount of data
Oracle RAC
SQL server
Influence of
Google and Amazon
(adopter of large clusters)
New NoSQL products
Google → BigTable
Amazon → Dynamo
Several actors have already addressed this in the past

NoSQL and the BigData Galaxy
A combination of V

NoSQL: a movement
Driven by a set a common characteristics
Open-sourceNot using a
relational
database
Running well
on clusters
Schemaless

NoSQL: very ill-defined
Not Only SQL
Polyglot Persistence
M.Fowler approach

NoSQL databases types
Key-Value database
Document database
Column Family database
Graph databases

Key-Value database
Are based on distributed hash tables
● 3 operations: set, get, delete
Data in RAM (cache) or persisted in SSD or disk (true db)
A lot of examples: Ehcache, MemcacheD, Redis, Amazon DynamoDB, Riak,
Voldemort, Basho, ...

Document database
A document is a set of ordered key-value pairs
Any document could be different from all previous inserted
documents
⇒ Document databases are designed to accommodate
variations in documents within a collection
Collections are groups of similar documents

Document database
Similar to Key-Value DBs where the Value is semi-structured, it is the
with arbitrary, nested data formats and varying format
Document DBs enable you to query and filter based on elements
Sharding can be based on a field
that is not the key
Secondary indexes on nested columns

Column-oriented database
Row-based systems are designed to efficiently return data for an entire row
Column-oriented systems are more efficient when an aggregate needs to be
computed over many rows but only for a small subset of all columns of data
Examples: BigTable, HBase, Druid
Cassandra is a hybrid between a key-value and a column-oriented database
10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
40000:001,50000:002,44000:003,55000:004;
001:10,Smith,Joe,40000;
002:12,Jones,Mary,50000;
003:11,Johnson,Cathy,44000;
004:22,Jones,Bob,55000;

Graph DB
No need to create tables to model many-to-many relations
Instead they are explicitly modeling using edges
Several use cases: Social Graph, Maps use cases, etc

NoSQL avantages
SchemalessScalability
Rich
Content
Cost
Control

Favor Scale-out over Scale-up
With NoSQL, adding server has
often no Impact
NoSQL are designed to utilize
available in a cluster with minimal
intervention by DBA
Scale up Scale out
With RDBMS, adding CPU,
Memory, Processors rises
migration issues or buying a
new server maybe rises
downtime
Scalability

Flexible schema
Schemaless
Denormalization keeps data that is frequently used
together in the document
Embedded
document

All NoSQL DB promote denormalization and that eliminates, or at least reduces,
the need for joins
Improve query performance over more normalized models (Join is a costly
operation)
Denormalization
Schemaless Schemafree

Aggregate Data Model
A more complex structure than a set of tuples
An aggregate is a collection of related objects that we wish to
treat as a unit for data manipulation, management a
consistency
Eric Evant’s DDD
● We can think on term of complex record that allows: List,Map and other data structures
to be nested inside it
● We like to update aggregates with atomic operation
RICH
CONTENT

Aggregate Data Model Example
● The customer contains a list
of billing addresses;
The order contains a list of:
order items,
a shipping address,
and payments
The payment itself contains a
billing address for that
payment
A single address appears 3 times,
but instead of using an id it is
copied each time
We like to communicate with our data storage in terms of aggregates
RICH
CONTENT

Aggregate Models
Different approach of relational data model
● Relation database are don’t have the concept of aggregate (aggregate-ignorant)
● With aggregates, there is often no need for joins
RICH
CONTENT

Aggregate Boundaries
Two aggregates: Customer and Order
Links between aggregates are relationships
Instead of using an id, a same data can be stored several
times (e.g. the address)
We can draw our aggregate differently
//Customer
{
"id": 1,
"name": "Fabio",
"billingAddress": [
{
"city": "Paris"
}
]
}
//Orders
{
"id": 99,
"customerId": 1,
"orderItems": [ ..],
"shippingAddress": [ {"city": "Paris”} ],
"orderPayment": [
"billingAddress": [ {"city": "Paris”} ],
….
]
}
RICH
CONTENT

Aggregates, the trade-off
Solve the
impedance
mismatch
Easier to work
on cluster
(Unit for replication
and sharding)
NoSQL doesn’t
support Atomicity
that spans multiple
aggregate
Not adaptable for
all the needs
(e.g. analyze its product sales
over the last months)
RICH
CONTENT

Aggregate with NoSQL types
Key-Value and Document databases are strongly aggregate-oriented
With key-value DBs
the aggregate is opaque (Blob)
the aggregate can be any type of object
the aggregate is only accessed by the key
With Document DBs, we can see a structure in the aggregate
we define structure on the data
can submit queries based on fields

Aggregate : not a systematic solution
Advanced data denormalization with Redis

NoSQL are often free of cost
COST
CONTROL
The major open source are free
No licence
No politics based on the number of users
No politics depends on the numbers of server
Most companies behind the NoSQL products provide commercial
support, advanced (frequently indispensable) monitoring tools, in
collaboration with SaaS solutions

Sharding & Replication
Sharding (or partitioning depending of the products...)
● Divided into disjoint sets
● To scale out
Replication
● Duplicate the data (on different node)
● To ensure high-availability
Both: each shard is replicated

Sharding: goodness and costliness
We shard data to allow scale out
● Scale up means use a more powerful machine
● Scale out means use more machines
Scale out to increase
● The throughput or the total amount of data or ...
The main cost of sharding is about distributed locks and transactions
● Give up TX and rely on atomic operations on aggregate is a solution to
achieve linear horizontal scalability

Replication: the way to achieve HA
Replication can be
● Synchronous or asynchronous
○ A trade off between performance and consistency
● Master/slaves or peer-to-peer
○ master/slaves is better to implement locks (no-distributed)
○ peer-to-peer is better to HA (no election when a failure occurs)
Main motivations
● Mostly to increase the “High Availability”

Example 1: sharding and primary/slaves replicas
Copy schema from old
commercial presentation
(page 40, CVAT)

Example 2: Sharding and p2p replicas

Cassandra is well suited for write intensive applications
Mainly because each node performs APPENDS on the file systems
Tunable consistency
Focus on Cassandra with P2P architecture

CAP Theorem
Distributed databases cannot have
consistency (C), availability (A) and partition protection (P) at the same time
Consistency: A read is guaranteed to return
the most recent write for a given client
Availability: every request received by a
non-failing node in the system must result in a response
Partition Tolerance: the system continues
to operate despite arbitrary partitioning due to network failures
Also known as the Brewer’s theorem

CAP theorem gotchas
Consistent != global state
There are several definitions of Consistency. It more about linearization: find a point of view
(so an order of events respectful of causality) where the final state is correct
Availability != Vivacity
A failing node do not remove the availability property. But a dead system is not very useful.
Because a read-only system is more convenient, we will prefer “CP” to “CA” for distributed systems.
Networks are not reliable

NoSQL Quorum to the rescue
A quorum is the number of servers that must respond to a read
or write operations for the operation to be considered OK.
A big enough is often required to ensure the wished consistency

Availability & Consistency in Distributed Databases
We often sacrifice Consistency
for Scalability, Availability or Performance
However many enterprise use case needs
(Strong) Consistency
Eventual Consistency
“There may be times when the data is inconsistent”
Eventually consistent means that some replicas might be inconsistent for some period for time
but will become consistent at some point

Two Phase Commit (2PC)
A two-phase commit is a transaction that require writing data to two separate
locations
Help ensure consistency
With 2PC, the DB favors consistency but at the risk of the most recent data not
being available for a brief period of time
While the 2PC is executing, transactions are longer. The updated data is
delayed until the 2PC finishes (the lock takes more time)
Favor Consistency over availability

BASE Transactions for NoSQL
BA
Basically available
S
Soft safe
E
Eventually consistency
BA: There can be partial failure in some parts of the distributed system and the rest of teh
system continues to function
S: It refers to the fact that data may eventually be overwritten with more recent data (this
property overlaps with eventual consistency)
E: There may be times when the database is in an inconsistent state

Schemaless in depth
Schemaless DBs do not require formal structure specification
It doesn’t make sense to require data modelers to specify all possible document
fields prior to building and populating the database
Attention: Schemaless doesn’t mean no schema
Schema is often implicit in the code

Polymorphic Schema
Polymorphic Schema
Derived from Latin and literally means “many shapes”
Each document can have a different structure
Created dynamically when the document is inserted

Which NoSQL database ?
Multiple criteria
- Volume of reads and write (throughput)
- Tolerance for inconsistent data in replicas
- The nature of relations between entities and how that
affects query patterns
- Availability and disaster recovery requirements
- The need for flexibility in data models
- Latency requirement
- Volume of data

Quizz - NoSQL DBs Uses cases
Application that
use JSON data
structure
?
Frequent small
reads and writes
along with simple
data models
?
Caching data from
relational DBs to
improve performance
?
Application that are
geographically
distributed over
multiple data
centers
?
Social networking
?

Additional Key-value DBs Uses cases
Backend support
for websites with
high volumes of
reads and write
Key-Value DBs
Storing large
objects such as
images and audio
files
Key-Value DBs
Tracking transient
attributes in a web
application such as a
shopping cart
Key-Value DBs

Additional Document DBs Uses cases
Application that
use JSON data
structure
Document DBs
Tracking variable
type of metadata
Document DBs
Storing
configuration and
user information for
mobile applications
Document DBs

Additional Column family DBs Uses cases
Application with the
potential for truly large
volumes of data such as
hundreds of terabytes
Colum family
DBs
Applications with
dynamic fields
Colum family
DBs

Additional Graph DBs Uses cases
Network and IT
infrastructure
management
Graph DBs
Recommending
products and
services
Graph DBs

Quizz - NoSQL DBs Uses cases
Application that
use JSON data
structure
Document DBs such
as MongoDB
Frequent small
reads and writes
along with simple
data models
Key-Value DBs
such as Redis
Caching data from
relational DBs to
improve performance
Key-Value DBs
such as Redis
Application that are
geographically
distributed over
multiple data
centers Colum DBs such
as Cassandra
Social networking
GraphDB such
as Neo4j

NewSQL movement
The co-existence between of RDBMS and NoSQL features in the same product
NewSQL s a class of modern RDBMS’s that seek to provide
The same scalable performance of NoSQL systems for read-write workloads
ACID guarantees of a traditional relational database system.

TimeSeries DB
● Consists of sequence of values or events
changing with time
○ Data is recorded at regular intervals
● Very used within Microservices
Architecture and with DDD approaches
● Applications
○ Financial: stock price, inflation
○ Biomedical: blood pressure
○ Meteorological: precipitation
● Already several technologies
○ DruidDB
○ InfluxDB
○ Redis

Treat the database as a Application database
The responsibility for database integrity is put in the service
With application database,
the database is only acceded by a single
application codebase ⇒ a single team /
a single application
Only the team need to know the
database structure
We favor application communication by Web
Services
Give more freedom to choose a database

Polyglot Persistence
Several DBs technologies for a single application
● We use Service wrapping
pattern for each DB
● Developers want different
APIs for different problems
● Most organizations have for
now a mix of data storage
technologies for different
circumstances

Suitable for Microservices Architecture
● Each Service manages its
own data
○ The data consistency
is delegated to the
service
● Each is an independent
functional unit

Conclusion
Four factors favors NoSQL usage: Scalability, Cost, Flexibility and Availability
RDBMS and SQL is going to continue to exist
The solution is likely to be an hybrid of multiple technologies
Always the choice depends on your needs
RDBMS stayed a good choice in many scenarios (strong legacy, critical data,
etc)
We are entering in a world of Polyglot Persistence

Beyond Relational Databases

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Beyond Relational Databases

Similaire à Beyond Relational Databases (20)

Plus de Gregory Boissinot

Plus de Gregory Boissinot (20)

Dernier

Dernier (20)

Beyond Relational Databases