1. NoSQL databases
STATE OF THE ART
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 1
2. I - Overview
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 2
3. What is NoSQL?
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 3
4. (typically) NoSQL is …
Non-relational
Distributed
Horizontally scalable
Big data
Performant
Open source
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 4
5. Relational VS NoSQL
Property Relational NoSQL
Performance for high
data volume
Low High
Horizontal scalability Complex, error-prone Simple
Flexibility Low High
Consistency Strong (ACID) Eventual (BASE)
Indexing Multiple columns Single column
Data duplication Not possible Allowed
Standard query
language
Yes No
Data model Single Multiple
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 5
6. II - Models
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 6
7. Main NoSQL database models
Key-value
Document
Column
Graph
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 7
8. Key-value store. Data model
Key 1
Key 2
Key 3
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 8
Value 1
Value 2
Value 3
KEYS VALUES
9. Key-value store. Characteristics
PROS
Frequent reads / writes
Simple data model
Rapid query execution
CONS
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 9
Small reads / writes
Simple data model
Poor query capabilities
14. Column store. Data model
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 14
Column Family
Row1
Row2
Row
Key1
Row
Key2
Column1
name1 : value1
timestamp1
Column2
name2 : value2
timestamp2
ColumnN
nameN : valueN
timestampN
Column1
name1 : value1
timestamp1
Column3
name3 : value3
timestamp3
ColumnM
nameM : valueM
timestampM
15. Column store. Data model
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 15
Super Column Family
Row1
Row
Key1
SuperColumnX
…
name1
value1
time
stamp1
nameN
valueN
time
stampN
SuperColumnY
…
name1
value1
time
stamp1
nameM
valueM
time
stamp
M
16. Column store. Characteristics
Large number of data
(in dynamic columns)
Fast queries on columns
(usually reads)
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 16
PROS CONS
Slow queries on rows (usually
writes)
21. Other NoSQL database models
• Based on few other modelsMultimodel
• Follows OOP principlesObject-oriented
• Mutli-valued attributesMultiValue
• Optimized to managa time series dataTime series
• …And many more
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 21
22. Comparison of NoSQL models *
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 22
Model Performance Scalability Flexibility Complexity Functionality
Key-value high high high none variable (none)
Document high variable (high) high low variable (low)
Column high high moderate low minimal
Graph variable variable high high graph theory
Relational variable variable low moderate relational
algebra
* Summary of a presentation by Ben Scofield: https://www.slideshare.net/bscofield/nosql-codemash-2010
23. Comparison by data size / complexity
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 23
Key-value Column Document Graph
Data size
Data complexity
24. III – Software
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 24
25. Criteria for evaluation
Popularity rank *
Data model
Consistency
Availability
Concurrency
Scalability
Querying
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 25
* According to DB-Engines ranking https://db-engines.com/en/ranking (April 2017). Relational DBMSs where discarded.
26. TOP 4 Systems
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 26
MongoDB
Cassandra
Redis
Elasticsearch
1
2
3
4
Document
Column + key-value
In-memory key-value
Document (search engine)
27. Consistency
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 27
MongoDB
• Configurable
• Strong by default
Cassandra
• Configurable
Redis
• Eventual
Elasticsearch
• Configurable
• Consistent, with
options
28. Availability
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 28
MongoDB
• Replicated
Cassandra
• Distributed
Redis
• Replicated
Elasticsearch
• Replicated
High
availability
29. Concurrency
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 29
• Multi-
granularity
locking
(MGL)
MongoDB
• Multiversion
concurrency
control
(MVCC)
Cassandra
• Optimistic
concurrency
control (OCC)
Redis
• Optimistic
concurrency
control (OCC)
Elasticsearch
30. Scalability
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 30
• High (automatic
data sharding)
MongoDB
• High (automatic
addition /
removal of
nodes in cluster)
Cassandra
• Poor
Redis
• High (dynamic
sharding on live
cluster)
Elasticsearch
31. Querying
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 31
• Internal API
(MapReduce)
• Complex query
support
MongoDB
• Internal API, CQL
SQL-like
• Complex query
support
Cassandra
• By key or value
range
• Rapid
• No complex
queries
Redis
• Own query
language (Query
DSL)
• Full text search,
filters
Elasticsearch
32. IV – Geospatial
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 32
33. GIS (geographic information system)
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 33
34. 4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 34
35. Idea behind GIS « magic »
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 35
Geospatial
data
Geohash API
GIS
support
37. Solutions
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 37
New document format GeoJSON (MongoDB)
GeoMesa + Apache Spark (Hadoop)
CQL extension (Cassandra)
GeoCouch extension (CouchDB)
Fast I/O in-memory geospatial operations (Redis)
Library Neo4j Spatial (Neo4j)
38. V - Conclusion
4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 38
39. 4/14/2017 BY MARKIYAN RIZUN, UNIVERSITÉ LILLE 1, SOFTEAM, EMAIL: MRIZUN@SOFTEAM.FR 39
Notes de l'éditeur
Quick look on NoSQL
NoSQL = Not Only SQL … OK, but what kind of properties does it have?
Normally, with quite a few exceptions, NoSQL systems should satisfy following list.
All of them are non-relational, and one could argue that this is the main difference.
Distributed, meaning working on clusters of machines.
Therefore, they should be horizontally scalable. This means that one could easily add new node to cluster without time-consuming process of restructuring database.
NoSQL systems are mostly designed for storing massive volumes of data and keeping high level of performance.
Usually, they are open source.
NoSQL is designed to work with big data and still show high levels of performance. On the contrary, relational DBs work well until they are dealing with large amounts of data.
Opposed to NoSQL, relational DBs require hard work in order to scale them horizontally.
Flexibility here means ease of INSERT / UPDATE operations. For relational case data must be in predefined form, for NoSQL – arbitrary form.
Relational are always ACID (atomicity, consistency, isolation, durability), however NoSQL proposes the concept of eventual consistency. BASE (Basically Available, Soft state, Eventual consistency)
There are many differences, but two very important are standard query language and single data model.
As NoSQL may be presented with a number of varying models, we have to review them.
These are 4 main models of NoSQL databases that we are going to study in details. First, KV is just a dictionary (collection of kv pairs). Next we have Document, which is a collection of different documents such as JSON, XML and others. Column DB consists column family that contains varying in name and size column collections – we will see it later. And graph model – this one focuses on connections between entities.
The data model is straightforward: a collection of kv pairs, where each key has only one corresponding value. Keys are used as indexes and values may contain any data
Rapid query execution because of the simple model and keys as indexes.
Main elements of these databases are documents, which are hierarchical tree data structures. Each document is represented by an indexed key (unique identifier that may be a string, URI or path). Information about given object is stored in a single document, unlike it is organised in relational databases (scattered over different tables). Documents may be of different types (JSON, XML, etc.).
they may offer an API that would enable users to query documents based on their internal structure and content
No joins = instead one would have to collect connected data manually
Central elements of database are columns. A column contains name (unique identifier), value (data itself) and timestamp (it allows to determine whether the content is valid, i.e. up to date). Then we have row with row key and associated set of columns. Collection of rows forms column family. Each row of the column family may contain a different number of columns and, additionally, there may be various column names.
Also it is possible to have supercolumns – the column, value of which is a map of columns.
Fast queries on columns: For example if I'm looking at a database of Sales and I want to see how Price has changed over time, I need to look at the Price field for a lot of records, so it's nice to have those stored together in one column.
Slow queries on rows: on the contrary, the query that the column store doesn’t like is something like "show me all the information about a particular Sale“ or add a Sale to database. Here you want lots of fields, but for a small number of rows – one.
This type of databases uses graph structures to represent, store and manage data. Graph database has concepts of edges, nodes and properties. The relationships (edges) link entities (nodes) directly using pointers (unlike in relational databases). Properties can be applied to both nodes and edges, and they help to query data.
Well-suited for networ modelling, such as social networks
Graph-like queries such as search for the shortest path between nodes
Use of pointers allows to retrieve connected data in one operation (instead of searching through the data and using join operations as it is in relational approach). This enables rapid and deep traversal of the graph structure
Unlike other NoSQL models, graph databases fully support ACID properties
Does not support data sharding, meaning that all data must be stored on single server
Hence, poor horizontal scalability
As we have seen before, there are lots of different systems on the market. Now we will take a look at only few of them and we will try to evaluate them. For that we need some criteria…
We selected top 4 systems which are … They use corresponding models …
all systems have configurable consistency, except Redis.
Replicated means that data is divided in several replica sets – shards, usually in this case master-slave model is used. Distributed means that each node in the cluster is responsible for a given data set.
All of them are highly available.
Concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible. Each system uses different method end ensures concurrency.
MGL - locks objects that contain other objects. It exploits the hierarchical nature of MongoDB documents.
MVCC takes a different approach: each user connected to the database sees a snapshot of the database at a particular instant in time. Any changes made by a writer will not be seen by other users of the database until the transaction has been committed.
OCC assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. If the check reveals conflicting modifications, the committing transaction rolls back and can be restarted
In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user).
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster
This is interesting for us in context of DataBio project that Softeam participates in.
GIS allows you to record a map with a geospatial referencing system such as longitude or latitude and then to add additional layers of other information.
Layers can be linked. Analysis of the information can then be undertaken using the statistical and analytical tools that are provided as part of the GIS. It is possible to provide visual representations of data. These representations can often reveal patterns and trends that might otherwise have gone unnoticed without the use of GIS techniques.
Usecases:
Mapping of data (visual representation of data on map)
Proximity analysis (distance between objects, points, polygons etc.)
Finding clusters
Find nearest
What’s in area?
Taxi manager example implemented with GeoMesa that is used in Hadoop-based NoSQL systems
NoSQL database must support
geospatial data
store geohash as integer index (e.g. quadtree, R-Tree or Hilbert curves index) converted from 2D, 3D or 4D coordinates and time.
Provide API / query language to work with data
As a result – NoSQL DB can be used in GIS