"NoSQL Databases: An Overview", Lena Wiese, Research Group Knowledge Engineer at Göttingen University
YouTube Link: https://www.youtube.com/watch?v=RDXgBzAaF4g
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dr. Lena Wiese is head of the research group Knowledge Engineering and lecturer at the Georg August University Göttingen. She has been teaching advanced courses on data management and database technology for several years at both graduate and undergraduate level. She is author of the book "Advanced Data Management for SQL, NoSQL, Cloud and Distributed Databases" (DeGruyter, 2015)
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
" NoSQL Databases: An Overview" Lena Wiese, Research Group Knowledge Engineer at Göttingen University
1. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Institut für Informatik
NOSQL Databases – An Overview
Dr. Lena Wiese
Institut für Informatik
Research Group Knowledge Engineering
Fakultät für Mathematik und Informatik
Georg-August Universität Göttingen
Data Natives 2015
Dr. Lena Wiese NOSQL DBs 1 / 49
2. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Institut für Informatik
Short CV Dr. Lena Wiese
University of Göttingen (Research Group Leader Knowledge
Engineering)
University of Hildesheim (Visiting Professor for Databases)
National Institute of Informatics, Tokyo, Japan
Robert Bosch India Ltd., Bangalore, India
Master/PhD: TU Dortmund
Teaching and Research
NoSQL databases (lecture, seminars, projects)
Database security (encryption for Cassandra and HBase)
Web: http://wiese.free.fr/
Dr. Lena Wiese NOSQL DBs 2 / 49
3. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Institut für Informatik
Book / Workshop
Master’s level text book (in English):
Lena Wiese: Advanced Data Management for SQL,
NoSQL, Cloud and Distributed Databases
c 2015 DeGruyter/Oldenbourg
http://wiese.free.fr/adm.html
(several pictures in this talk taken from the book)
Workshop NoSQL-Net
organized jointly with Irena Holubova (Charles
University, Prague) and Valentina Janev (Instituto
Mihailo Pupin, Belgrade)
3rd edition to be organized at DEXA conference 2016
We welcome industrial application papers
Dr. Lena Wiese NOSQL DBs 3 / 49
4. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Introduction :: Content Institut für Informatik
Overview
1 Introduction
Content
New Requirements
2 Graph Databases
3 XML Databases
4 Key-Value Stores
5 Document Stores
6 Column Stores
7 BigTable Databases
8 Polyglot Data Base Architectures
9 Conclusion
Dr. Lena Wiese NOSQL DBs 4 / 49
5. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Introduction :: Content Institut für Informatik
Content
SQL
Tabular row-wise storage: Relational Databases (RDBs)
Query Language: SQL
versus
NOSQL (Not Only SQL)
Graph Databases
XML Databases
Key-value Stores
Column Stores
Bigtable Databases
Object Databases and Object-Relational Databases
...
Dr. Lena Wiese NOSQL DBs 5 / 49
6. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Introduction :: Content Institut für Informatik
What is a Database System?
A database system is required to
manage
huge amounts of data
in an e cient,
persistent,
reliable,
consistent,
non-redundant way
for multiple users
Dr. Lena Wiese NOSQL DBs 6 / 49
7. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Introduction :: New Requirements Institut für Informatik
New requirements
Data are organized in complex structures (example: social networks)
me
you
foe foo
Data are constantly changing (frequent updates)
Data are distributed on a huge number of interconnected servers
(example: cloud storage)
Dr. Lena Wiese NOSQL DBs 7 / 49
8. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Introduction :: New Requirements Institut für Informatik
New requirements
Data are organized in complex structures (example: social networks)
Data are constantly changing (frequent updates)
write1 write2 write3 write4 write5read1 read2
Data are distributed on a huge number of interconnected servers
(example: cloud storage)
Dr. Lena Wiese NOSQL DBs 7 / 49
9. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Introduction :: New Requirements Institut für Informatik
New requirements
Data are organized in complex structures (example: social networks)
Data are constantly changing (frequent updates)
Data are distributed on a huge number of interconnected servers
(example: cloud storage)
user
S1
S2
S3
S4
S5
data
Dr. Lena Wiese NOSQL DBs 7 / 49
10. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Introduction :: New Requirements Institut für Informatik
New requirements
Data are organized in complex structures (example: social networks)
Data are constantly changing (frequent updates)
Data are distributed on a huge number of interconnected servers
(example: cloud storage)
Revival of non-relational data models for novel applications
Dr. Lena Wiese NOSQL DBs 7 / 49
11. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Background Institut für Informatik
Overview
1 Introduction
2 Graph Databases
Background
Graph Management
Systems
3 XML Databases
4 Key-Value Stores
5 Document Stores
6 Column Stores
7 BigTable Databases
8 Polyglot Data Base Architectures
9 Conclusion
Dr. Lena Wiese NOSQL DBs 8 / 49
12. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Background Institut für Informatik
Why Graph Databases?
Links between data items are important
Example: Social Networks
Recommender Systems
Semantic Web
Geographic Information Systems
Bioinformatics
...
Name: Alice
Age: 34
Name: Bob
Age: 27
Name: Charlene
Age: 29
knows knows
dislikes
Dr. Lena Wiese NOSQL DBs 9 / 49
13. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Background Institut für Informatik
Why Graph Databases?
Links between data items are important
Social Networks
Recommender Systems
Semantic Web
Example: Geographic Information Systems
Bioinformatics
...
City: Hannover
Population: 522T
City: Hildesheim
Population: 102T
City: Braunschweig
Population: 248T
35km 45km
65km
Dr. Lena Wiese NOSQL DBs 9 / 49
14. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Graph Management Institut für Informatik
Property Graph Model
A Property Graph is a directed multigraph
Stores information (properties) in vertices and on edges
A Property is a key-value pair like “Name: Alice”
Sometimes multi-value properties: one key, list of values
For vertices and edges: predefined property key called Id with unique
identifier value
Id: 1
Name: Alice
Age: 34
Id: 2
Name: Bob
Age: 27
Id: 3
Name: Charlene
Age: 29
Id: 4 Id: 5
Id: 6
Dr. Lena Wiese NOSQL DBs 10 / 49
15. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Graph Management Institut für Informatik
Property Graph Model: Types and Labels
In general, not typed: vertices and edges store arbitrary key-value pairs
However, typing gives semantics to the stored data
For vertices: predefined property key called Type; like “Type: Person”
For edges: predefined property key called Label; optionally: specify
which edge label can be used between which vertex types
Id: 1
Type: Person
Name: Alice
Age: 34
Id: 2
Type: Person
Name: Bob
Age: 27
Id: 3
Type: Person
Name: Charlene
Age: 29
Id: 4
Label: knows
Id: 5
Label: knows
Id: 6
Label: dislikes
Dr. Lena Wiese NOSQL DBs 11 / 49
16. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Graph Management Institut für Informatik
Property Graph Model: Multiedges
Multiedges are only allowed when edge labels are di erent
Id: 1
Type: Person
Name: Alice
Age: 34
Id: 2
Type: Person
Name: Bob
Age: 27
Id: 3
Type: Person
Name: Charlene
Age: 29
Id: 4
Label: knows
Id: 5
Label: knows
Id: 6
Label: dislikes
Id: 7
Label: dislikes (not allowed!)
Dr. Lena Wiese NOSQL DBs 12 / 49
17. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Graph Management Institut für Informatik
Property Graph Model: Paths
Paths are serial concatenations of edges
End vertex of one edge is start vertex of next edge on the path
Id: 1
Type: Person
Name: Alice
Age: 34
Id: 2
Type: Person
Name: Bob
Age: 27
Id: 3
Type: Person
Name: Charlene
Age: 29
Id: 4
Label: knows
Id: 5
Label: knows
Dr. Lena Wiese NOSQL DBs 13 / 49
18. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Graph Management Institut für Informatik
Property Graph Model: Paths
Path “friends-of-friends” concatenates two edges with “Label: knows”
Paths can be used as normal edges
Id: 1
Type: Person
Name: Alice
Age: 34
Id: 2
Type: Person
Name: Bob
Age: 27
Id: 3
Type: Person
Name: Charlene
Age: 29
Id: 4
Label: knows
Id: 5
Label: knows
Path friends-of-friends
Dr. Lena Wiese NOSQL DBs 13 / 49
19. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Graph Databases :: Systems Institut für Informatik
Open Source Systems
The TinkerPop http://tinkerpop.incubator.apache.org/
graph processing stack: a set of open source graph management
modules
Neo4J graph database http://neo4j.com/
Cypher query language
START alice = (people_idx, name, "Alice")
MATCH (alice)-[:knows]->(aperson)
RETURN (aperson)
HyperGraphDB: http://www.hypergraphdb.org/
Graph may contain hyperedges that combine more than two nodes
Dr. Lena Wiese NOSQL DBs 14 / 49
20. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
XML Databases :: Background Institut für Informatik
Overview
1 Introduction
2 Graph Databases
3 XML Databases
Background
Numbering Schemes
Systems
4 Key-Value Stores
5 Document Stores
6 Column Stores
7 BigTable Databases
8 Polyglot Data Base Architectures
9 Conclusion
Dr. Lena Wiese NOSQL DBs 15 / 49
21. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
XML Databases :: Background Institut für Informatik
XML
XML: Extensible Markup Language
Defined by the WWW Consortium (W3C)
Intended as a document markup language (not a database language)
Tags divide documents into sections
Tag: label for a section of data
Element: section of data beginning with <tagname> and ending with
matching </tagname>
Inside an element:
arbitrary text
other elements (“nesting”)
Nothing (“empty element”): abbreviate to <tagname />
Standardized query languages: XPath and XQuery
Dr. Lena Wiese NOSQL DBs 16 / 49
22. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
XML Databases :: Background Institut für Informatik
Tree Model of XML Data
<reservationsystem>
<hotel hotelID="h1">
<name>
Buergermeisterkapelle
</name>
<location>
Hildesheim
</location>
<pricesgl>
65 Euro
</pricesgl>
</hotel>
</reservationsystem>
0
reservationsystem
1
hotel
2 3
name
4
Buergermeisterkapelle
5
location
6
Hildesheim
7
pricesgl
8
65 Euro
hotelID
h1
Dr. Lena Wiese NOSQL DBs 17 / 49
23. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
XML Databases :: Numbering Schemes Institut für Informatik
Numbering Scheme
assigns each node of an XML tree a unique identifier (a label or node
ID which is usually a number)
Important for database application with frequent updates:
How many nodes have to be renumbered in an update?
simplest scheme: preorder traversal of tree
increasing a counter for each node:
root node is numbered as the first node before numbering any other
node
this is done recursively for all child nodes
Renumbering: all nodes in the worst case
Dr. Lena Wiese NOSQL DBs 18 / 49
24. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
XML Databases :: Systems Institut für Informatik
Open Source Systems
eXistDB: http://exist-db.org/
numbering scheme that virtually expands the tree into a complete tree
such that not all node IDs correspond to existing nodes
eXistDB o ers several user APIs: RESTful API, XML:DB API,
XML-RPC API, SOAP AP
BaseX: http://basex.org/
Numbering scheme: Pre/Dist/Size
Several language bindings as well as a REST API, an XQJ API and a
XML:DB API
Dr. Lena Wiese NOSQL DBs 19 / 49
25. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Key-Value Stores :: Background Institut für Informatik
Overview
1 Introduction
2 Graph Databases
3 XML Databases
4 Key-Value Stores
Background
Systems
MapReduce
Systems
5 Document Stores
6 Column Stores
7 BigTable Databases
8 Polyglot Data Base Architectures
9 Conclusion
Dr. Lena Wiese NOSQL DBs 20 / 49
26. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Key-Value Stores :: Background Institut für Informatik
Key-Value Stores
A key value pair is a tuple of two strings Èkey, valueÍ
You can get (or delete) a value from the store by key
Schema-less: you can put arbitrary key-value pairs into the store
value = store.get(key)
store.put(key, value)
store.delete(key)
Values can have other data types than just strings
Values can even be a list or array of atomic values
Simple but quick
Simple data structure
No advanced query language
Good for “data-intensive” applications
Application is responsible or combining key-value pairs into more
complex objects
Dr. Lena Wiese NOSQL DBs 21 / 49
27. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Key-Value Stores :: Systems Institut für Informatik
Open Source Systems
Redis: http://redis.io/
in-memory key-value store
data types: string, linked lists, unsorted set, sorted set, hash, bit array,
hyperloglog
Riak-KV: http://basho.com/products/riak-kv/
key-value pairs called Riak objects grouped into buckets
convergent replicated data types (CRDTs)
Riak’s search functionality based on Apache Solr (Yokozuna)
Dr. Lena Wiese NOSQL DBs 22 / 49
28. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Key-Value Stores :: MapReduce Institut für Informatik
MapReduce
Applied at Google
Je rey Dean / Sanjay Ghemawat, “MapReduce: Simplified Data
Processing on Large Clusters”, OSDI’04: Sixth Symposium on
Operating System Design and Implementation, 2004.
“The computation takes a set of input key/value pairs, and produces
a set of output key/value pairs. The user of the MapReduce library
expresses the computation as two functions: Map and Reduce.”
Four basic steps
1 split input key-value pairs into disjunct subsets
2 compute map function on each input subset
3 group all intermediate values by key (shu e)
4 reduce values of each group
Dr. Lena Wiese NOSQL DBs 23 / 49
29. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Key-Value Stores :: MapReduce Institut für Informatik
MapReduce: Example
split map shu e reduce
sentence1 (word3,1) (word1,(1,1,1)) server4 (word1,3)
sentence2 server1 (word4,1) (word2,(1)) (word2,1)
sentence3 (word3,1)
(word1,1)
sentence4 server2 (word1,1) (word3,(1,1,1)) server5 (word3,3)
sentence5 (word2,1) (word4,(1,1)) (word4,2)
(word4,1)
sentence6 server3 (word3,1)
sentence7 (word1,1)
Dr. Lena Wiese NOSQL DBs 24 / 49
30. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Key-Value Stores :: Systems Institut für Informatik
Open Source Systems
Apache Hadoop: http://hadoop.apache.org/
Hadoop Distributed File System (HDFS)
Apache Spark: http://spark.apache.org/
data flow programming model on top of Hadoop
Apache Pig: http://pig.apache.org/
express parallel execution of data analytics tasks.
input={(’alice’,{’charlene’,’emily’}),
(’bob’,{’david’,’emily’})};
output = FOREACH input GENERATE $0, FLATTEN($1);
Apache Hive: http://hive.apache.org/
querying and data management layer
can serialize tables as files in HDFS
HiveQL queries are compiled into Hadoop MapReduce tasks
Dr. Lena Wiese NOSQL DBs 25 / 49
31. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Document Stores :: Background Institut für Informatik
Overview
1 Introduction
2 Graph Databases
3 XML Databases
4 Key-Value Stores
5 Document Stores
Background
Systems
6 Column Stores
7 BigTable Databases
8 Polyglot Data Base Architectures
9 Conclusion
Dr. Lena Wiese NOSQL DBs 26 / 49
32. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Document Stores :: Background Institut für Informatik
JSON: JavaScript Object Notation
human-readable text format
more compact than XML
nesting of key-value pairs
{
"firstName":"Alice",
"lastName" :"Smith",
"age":31,
"address" :{
"street":"Main Street",
"number":12,
"city":"Newtown",
"zip":31141
} ,
"telephone":[935279,908077,278784]
}
Dr. Lena Wiese NOSQL DBs 27 / 49
33. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Document Stores :: Systems Institut für Informatik
Open Source Systems
MongoDB: https://www.mongodb.org/
BSON storage format (binary JSON representation)
db.persons.find(age$lt: 34)
CouchDB: http://couchdb.apache.org/
retrieval process with views defined as map function
function(doc) {
if(doc.lastname && doc.age) {
emit(doc.lastname, doc.age);
}
}
Couchbase: http://www.couchbase.com
SQL-like query language
Dr. Lena Wiese NOSQL DBs 28 / 49
34. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Column Stores :: Background Institut für Informatik
Overview
1 Introduction
2 Graph Databases
3 XML Databases
4 Key-Value Stores
5 Document Stores
6 Column Stores
Background
Column Compression
Systems
7 BigTable Databases
8 Polyglot Data Base Architectures
9 Conclusion
Dr. Lena Wiese NOSQL DBs 29 / 49
35. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Column Stores :: Background Institut für Informatik
Why Column Stores?
A row store is a row-oriented relational database
Data are stored in tables
On disk, data in a row are stored consecutively
Currently used in most commercially successful RDBMSs
A column store is a column-oriented relational database
Data are stored in tables
On disk, data in a column are stored consecutively
In use since the 1970s but less successful than row stores
Example
BookLending BookID ReaderID ReturnDate
123 225 25-10-2011
234 347 31-10-2011
Storage order in row store:
123,225,25-10-2011,234,347,31-10-2011
Storage order in column store:
123,234,225,347,25-10-2011,31-10-2011
Dr. Lena Wiese NOSQL DBs 30 / 49
36. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Column Stores :: Background Institut für Informatik
Advantages of Column Stores
Only columns (attributes) that are needed are read from disk into
main memory, because a memory page contains only values of a
column
Values in a column (that is, values of the same attribute domain) can
be compressed better when stored consecutively (“locality”)
Iterating or aggregating over values in a column can be done quickly,
because they are stored consecutively
For example, summing up all values in a column, finding the average,
maximum...
Adding new columns to a table is easy
Dr. Lena Wiese NOSQL DBs 31 / 49
37. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Column Stores :: Column Compression Institut für Informatik
Column Compression
Columns may contain lots of repetitions of values
Compression can be more e ective on columns
Option 1: run-length encoding
run-length: how many repetitions of a value are stored consecutively?
Option 2: bit-vector encoding
create a bit vector for each value in the column
Option 3: dictionary encoding
create a dictionary for single values or sequences of values
Option 4: frame of reference encoding
store o -set from a reference point
Option 5: di erential encoding
store o -set from previous value
Stavros Harizopoulos / Daniel Abadi / Peter Boncz,
“Column-Oriented Database Systems”, VLDB Tutorial, 2009
Dr. Lena Wiese NOSQL DBs 32 / 49
38. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Column Stores :: Column Compression Institut für Informatik
Example: Run-Length Encoding
BookLending BID RID RD
123 225 25-10-2012
386 225 20-10-2012
938 225 27-10-2012
123 347 25-11-2012
234 347 31-10-2012
Store ReaderID (RID) in run-length encoding
count number of consecutive repetitions
format: (value, start row, run-length)
RID: ( (225, 1, 3), (347, 4, 2) )
Answer queries on compressed format
How many books does each reader have?
SELECT RID, COUNT(*) FROM BookLending GROUP BY RID
Just return (the sum of) the run-lengths for each ReaderID value
Result: (225, 3), (347, 2)
Dr. Lena Wiese NOSQL DBs 33 / 49
39. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Column Stores :: Systems Institut für Informatik
Systems
MonetDB: https://www.monetdb.org/
open source “column store pioneers”
Apache Parquet: http://parquet.apache.org/
implements column striping: transform nested data to columns
Commercial systems
SAP HANA
HP Vertica
Dr. Lena Wiese NOSQL DBs 34 / 49
40. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
BigTable Databases :: Background Institut für Informatik
Overview
1 Introduction
2 Graph Databases
3 XML Databases
4 Key-Value Stores
5 Document Stores
6 Column Stores
7 BigTable Databases
Background
Storage Organization
Systems
8 Polyglot Data Base Architectures
9 Conclusion
Dr. Lena Wiese NOSQL DBs 35 / 49
41. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
BigTable Databases :: Background Institut für Informatik
Google BigTable
Fay Chang / Je rey Dean / Sanjay Ghemawat / Wilson C. Hsieh /
Deborah A. Wallach / Mike Burrows / Tushar Chandra / Andrew
Fikes / Robert E. Gruber, “Bigtable: A Distributed Storage System
for Structured Data”, OSDI, 2006
“A Bigtable is a sparse, distributed, persistent, multi-dimensional
sorted map”
Google BigTable is indexed by a row key, column key, and a
timestamp
Map: ( row:string, column:string, time:int64) æ string
A Big Table may have an unbounded number of columns.
Columns are grouped into sets called column families.
Dr. Lena Wiese NOSQL DBs 36 / 49
42. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
BigTable Databases :: Background Institut für Informatik
BigTable & HBase Data Structure
Store data that is accessed together in a column family
Columns in a single column family can vary arbitrarily for each row.
Only fetch column families of columns that are required by query
Data locality: Store data in a column family together on disk
table row key column family column family
Library BID BookInfo LendingInfo
123
Title
Databases
Author
Miller
25-10-2012
Mayer
25-11-2012
Green
386
Title
Algorithms
Author
Jacobs
20-10-2012
Mayer
938
Title
Programming
Author
Brown
27-10-2012
Mayer
234
Title
SQL
Author
Smith
31-10-2012
Green
Dr. Lena Wiese NOSQL DBs 37 / 49
43. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
BigTable Databases :: Storage Organization Institut für Informatik
Writing to memory tables and data files
The most recent writes are collected in a main memory table
(memtable) of fixed size.
All data records written to the on-disk store will only be appended to
the existing records.
Once written, these records are read-only and cannot be modified:
they are immutable data files.
Any modification of a record must hence also be simulated by
appending a new record in the store.
Deletions are treated by writing a new record (tombstone) for a key.
Main memory
memtable
Disk
Sorted
file 1
Sorted
file 2
. . .Sorted
file n
flush
write
Dr. Lena Wiese NOSQL DBs 38 / 49
44. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
BigTable Databases :: Storage Organization Institut für Informatik
Reading from memory tables and data files
The downside of immutable data files is that they complicate the read
process:
retrieving all the relevant data that match a user query requires
combining records from several on-disk data files and the memtable.
This combination may a ect records for di erent search keys that are
spread out across several data files; but it may also apply to records
for the same key of which di erent versions exist in di erent data files.
In other words, all sorted data files have to be searched for records
matching the read request.
Main memory
memtable
block bu er
Disk
Sorted
file 1
Sorted
file 2
. . .Sorted
file n
read combine
combine
Dr. Lena Wiese NOSQL DBs 39 / 49
45. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
BigTable Databases :: Systems Institut für Informatik
Open Source Systems
Apache Cassandra: http://cassandra.apache.org/
column families in a keyspace
CQL: SQL-like query language
INSERT INTO bookinfo (bookid, title, author)
VALUES (1002,’Databases’,’Miller’);
Apache HBase: http://hbase.apache.org/
stores tables in namespaces
tables contain column families
Dr. Lena Wiese NOSQL DBs 40 / 49
46. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Polyglot Data Base Architectures :: Polyglot Persistence Institut für Informatik
Overview
1 Introduction
2 Graph Databases
3 XML Databases
4 Key-Value Stores
5 Document Stores
6 Column Stores
7 BigTable Databases
8 Polyglot Data Base Architectures
Polyglot Persistence
Lambda Architecture
Multi-Model Databases
9 Conclusion
Dr. Lena Wiese NOSQL DBs 41 / 49
47. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Polyglot Data Base Architectures :: Polyglot Persistence Institut für Informatik
Polyglot Persistence
Choose as many databases as needed
Fowler, M.J., Sadalage, P.J.: NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence.
Prentice Hall (2012)
Example: Apache Drill http://drill.apache.org/
Apache Drill is inspired by the ideas developed in Google’s Dremel
system
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive
analysis of web-scale datasets. Proceedings of the VLDB Endowment 3(1-2), 330–339 (2010)
Better introduce an integration layer (instead of pushing the burden of
query handling and database synchronization to the application level)
decomposing queries in to several subqueries
redirecting queries to the appropriate databases
recombining the results obtained from the accessed databases
Dr. Lena Wiese NOSQL DBs 42 / 49
48. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Polyglot Data Base Architectures :: Polyglot Persistence Institut für Informatik
Polyglot Persistence
Integration layer
Query decomposition
Query redirection
Result recombination
Synchronization
key-value
store
graph
database
SQL
database
in-memory
store
graph
traversal
analytical
query
write-heavy
transaction
SQL query REST-
based access
Dr. Lena Wiese NOSQL DBs 43 / 49
49. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Polyglot Data Base Architectures :: Lambda Architecture Institut für Informatik
Lambda Architecture
For real-time / streaming data
Combination of a slower batch processing layer and a speedier stream
processing layer
Speed layer collects only the most recent data and delivers these results
in several real-time views
Batch layer stores all data in an append-only and im- mutable fashion
in a so-called master dataset and results are delivered in so-called batch
views
Serving layer makes batch views accessible to user queries by
maintaining indexes
User queries answered by merging data from batch views and
real-time views
Open source implementation following the ideas of a lambda
architecture is Apache Druid http://druid.io/ (streaming data
in real-time nodes and batch data in historical nodes)
Dr. Lena Wiese NOSQL DBs 44 / 49
50. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Polyglot Data Base Architectures :: Lambda Architecture Institut für Informatik
Lambda Architecture
Batch layer
Batch view 1
Batch view 2
Batch view 3
Batch view 4
Master
data set
Speed layer
Recent
data set
Speed view 1
Speed view 2
Speed view 3
Speed view 4
Serving layer
Index 1
Index 2
Index 3
Index 4
Data
stream
append
append
merge
Dr. Lena Wiese NOSQL DBs 45 / 49
51. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Polyglot Data Base Architectures :: Multi-Model Databases Institut für Informatik
Multi-Model Databases
Use a database system that stores data in a single store but provides
access to the data with di erent APIs according to di erent data
models
Either support di erent data models directly inside the database
engine or o er layers for additional data models on top of a
single-model engine
OrientDB http://orientdb.com/
a document API, an object API, and a graph API (Java Graph API is
compliant with Tinkerpop)
extensions of the SQL standard to interact will all three APIs
ArangoDB https://www.arangodb.com/
a graph API, a key-value API and a document API
Query language AQL (ArangoDB query language) resembles
SQL but adds several database-specific extensions to it
Dr. Lena Wiese NOSQL DBs 46 / 49
53. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Conclusion Institut für Informatik
Overview
1 Introduction
2 Graph Databases
3 XML Databases
4 Key-Value Stores
5 Document Stores
6 Column Stores
7 BigTable Databases
8 Polyglot Data Base Architectures
9 Conclusion
Dr. Lena Wiese NOSQL DBs 48 / 49
54. { }
Knowledge
Engineering
K9
Georg-August-Universität Göttingen
Conclusion Institut für Informatik
Conclusion
Many, many other data models than just relational tables
Lots of di erent query languages (no standards)
Problems with reliability (no long-term experience, open source
development teams)
Which database you choose depends on your needs
Dr. Lena Wiese NOSQL DBs 49 / 49