a) A general introduction of graph databases and OrientDB,
b) Why connected data has more value than just data,
c)How to "have fun" with OrientDB combining documents with graphs via SQL,
d) A use case on how OrientDB has helped to raise standards in Irish Public Office.
On OrientDB: NOSQL document databases provide an elegant way to deal with data in different shapes enabling developers to create better and faster products quickly. The main goal of these systems is to find the most efficient solution to manage data itself. With the Big Data Explosion we need to deal with a myriad of highly interconnected information. The challenge now is not only on how to store data but on how to manage, analyse, traverse and use your data within the context of relationships. Graph databases shine at maintaining highly connected data and is the fastest growing category in database management systems: 2014 registered an increase of 250% in terms of adoption and Forrester Research predicts that more than a quarter of enterprises will be using graphs by 2017. OrientDB combines more than one NOSQL model offering the unique flexibility of modelling data in the form of either documents, or graphs, while incorporating object oriented programming as a way of encapsulating relationships.
Determinants of health, dimensions of health, positive health and spectrum of...
OrientDB: Unlock the Value of Document Data Relationships
1. OrientDB: Unlock the Value of Document Data
Relationships
Fabrizio Fortino
@fabriziofortino
11th April 2016
#HUGIreland
@boistartups
2. The world is changing
Unstructured
Data
Big Data
Explosion
Connected
Data
Mobile, IOT
http://destinhaus.com/internet-of-things-the-rise-of-smart-manufacturing/
3. “… starting a new strategic enterprise
application you should no longer be assuming
that your persistence should be relational. The
relational option might be the right one - but
you should seriously look at other alternatives.”
Polyglot Persistence [2011]
Martin Fowler
Rethink how we store data
4. A Polyglot Persistence example
E-commerce Application
Primary Store
+
Financial Data
(RDBMS)
Recommendations
(Graph)
Products Catalog
(Document)
User Sessions
(Key-Value)
ETL Jobs / Data Synchronisation
5. • Hire experts for each database type
• No standards between NOSQL products
• Increased overall complexity
• High TCO
• Write and maintain ETL and data synchronisation
• Hard to refactor
• Testing can be tough
More flexibility, at what price?
8. • First Multi-Model DBMS with a Graph Engine
• Community Edition FREE (Apache v2 License)
• Enterprise Edition (profiler, live monitor, telereporter, etc)
• Vibrant community (≈ 100 contributors, ≈ 15K commits)
• Easy to install and use
• Zero configuration Multi-Master Architecture
• ACID
• Reactive (Live Queries)
OrientDB at a Glance
9. Quite a long journey
1998 2009 2010 2011 20152012 20142013
OrientDB: First ever
multi-model DBMS
released as Open
Source
R&D
2016
OrientDB Enterprise
Launch
0
12K
70K
3K
1K
200
Downloads / month
Orient ODBMS: First
ever ODBMS with
index-free adjacency
10. Under the hood
Storage
Memory
Works in Memory Only
(Ideal for Integration Testing)
PLocal
Write/Read to/from File System
Remote
Delegates all Operations to a Remote
Server
Document API
Handles Records as Documents
Graph API
TinkerPop Blueprints Implementation
Object API
POJO to Document mapping
User Application
12. Document API
• Lowest level API
• Document (record) is the storage’s unit
• An immutable id (ORID) is automatically set to each
document
• Documents can contain key-value pairs or nested/
embedded documents (no ORID)
• Transactions support (optimistic mode with MVCC)
• Classes are logical sets of documents
13. Schema-less, Schema-full or Hybrid?
Schema-less
relaxed model, the type of each
field is inferred for each
document
Schema-full
strict model, schema with
constraints on fields and
validation rules
Hybrid
mixed model, schema with
mandatory and optional fields
with constraints and
validation rules
14. • Can inherits from other classes, creating a tree
(similar to RDF Schema)
• A sub-class inherits all the schema fields from
the parents
• An abstract class is used as the foundation for
other classes (it cannot have records)
• Class hierarchies allow native polymorphic
queries
• 1 to 1 mapping with domain objects
Class concept is taken from OOP
15. Let’s create a Document
`
{
”@rid": “#12:216”,
”@class": ”user",
“name”: “Fabrizio”,
“meetups”: [
{
“name”: “HUG Ireland”,
“city”: “Dublin”,
“since”: “14-03-2014”
}
],
“details”: {
“@type”: “d”,
“@class”: “user_details”
“city”:”Dublin”,
“nationality”:”IT”
}
}
Immutable Record ID
Logical set
Property
Array of objects
Embedded document
16. Let’s create a Document
`
{
”@rid": “#12:216”,
”@class": ”user",
“name”: “Fabrizio”,
“meetups”: [
{
“name”: “HUG Ireland”,
“city”: “Dublin”,
“since”: “14-03-2014”
}
],
“details”: {
“@type”: “d”,
“@class”: “user_details”
“city”:”Dublin”,
“nationality”:”IT”
}
}
Immutable Record ID
Logical set
Property
Array of objects
Embedded document
With a traditional Document DB you have to
duplicate your data to some degree. The degree
depends on how complex are the
interdependencies of the application domain.
OrientDB combines the unique flexibility of
documents with the power of graphs to unlock the
business value of Document Data Relationships.
17. Graphs: everything old is new again
https://en.wikipedia.org/wiki/Seven_Bridges_of_Königsberg
18. What is a Graph Database?
“A Graph Database is any storage system
that provides index-free adjacency”
The Graph Traversal Pattern [2010]
Marco A. Rodriguez
G = (V, E)
Graph
Vertex
Edge
A
19. • Given a User (Fabrizio)
• Find Fabrizio (id=10) in member table O(log n)
• Find 18 and 24 (Hug Ireland & Microservices) in Meetup table O(log n)
What’s wrong with joins?
name id
Fabrizio 10
Uli 12
John 13
Eddie 88
User
user_id meetup_id
10 18
10 24
13 18
88 66
member
id name
18 HUG Ireland
57 AWS Users
24 Microservices
66 Scala
Meetup
• Joins are computed every time you cross relationships
• Time complexity grows with data: O(log n)
• Joining 3-4 tables with million of records could create billion combinations
20. • Given a User (Fabrizio)
• Traverse the edges member to reach Hug Ireland O(1) & Microservices O(1)
• Fabrizio is the index to reach the linked Meetups!
The Graph as an Index
• Every vertex and edge is “hard wired” to its adjacent vertex or edge
• Traversing an edge does not require complex computation, near O(1)
• The traversal time is not affected by the database size
Fabrizio
HUG
Ireland
Micro
Services
member
member
Easier to sketch!
23. Will you believe me if I said you can query
documents/graphs with SQL like syntax?
Show me something now! OK, time for a quick demo.
http://www.sharegoodstuffs.com/2011_12_12_archive.html
25. • Aggressive deadline
• Large amount of data from different sources with
different formats
• Messy, dirty data
• Connects records from different sources
representing the same thing without a common
identifier
• Multiple steps traverse of fixed and inferred links
to identify disparate entities connected by a path
The challenges
27. • Main Language: Groovy
• Database Type: OrientDB Embedded
• Fuzzy Inference Engine: Duke
• minHash proximity index based on Lucene to avoid cartesian
product
• probabilistic model with configurable statistical algorithms
(Levenshtein, NGram, Soundex, Custom, etc) to identify the
same entities despite differences
• End-To-End Process Time < 10 min
• Deliverable: Database
• Preset of queries to answer the main questions (analysts are
completely independent to add / modify where conditions)
• GraphView to visually search and visualise data
Technical Details
28. What people from home perceived
≈ 20K tweets
Top hashtag in Ireland for 24 hours#rteinvestigates
29. “While we’ve long understood the value of Big Data to better
understand how people interact with us, we’ve noticed an
alarming trend of Big Data envy: organizations using complex
tools to handle “not-really-that-big” Data. Distributed map-
reduce algorithms are a handy technique for large data sets,
but many data sets we see could easily fit in a single node
relational or graph database. Even if you do have
more data than that, usually the best thing to do is
to first pick out the data you need, which can often
then be processed on such a single node”
OK but what about Big Data?
ThoughtWorksTechnology Radar, 5 April 2016