Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
1. Virtuoso:
The Prometheus of
RDF-based Relational Data
Management
By Orri Erling
Virtuoso Program Manager
OpenLink Software
2. Linked Data at Dawn
The Promise and the Practice
The Science of Speed
The Structure which Is
Ongoing Research
License CC-BY-SA 4.0 (International).
3. Linked Data Promises
RDF is a generic, minimalistic
model for describing things
RDF has global identifiers and
data is self-describing
URIs may be dereferenceable
RDF is flexible to query,
does not force a single hierarchical view like XML
License CC-BY-SA 4.0 (International).
4. Linked Data Scenarios
RDF is used because of
schema flexibility
global identifiers
Inference, if present, is usually trivial
Subclass
Sub-property
License CC-BY-SA 4.0 (International).
5. Where Triples Come From
Relational extracts or web content is converted to
and stored as triples
NLP extraction
New applications with RDF as primary data model
Doing SPARQL against data in RDBs is possible
but is rare and does not deliver the flexibility
License CC-BY-SA 4.0 (International).
6. Linked Data Verticals and Patterns
Publishing: tagging & annotations, evolving vocabularies
Archives: self description, long term identifiers, many
versions of schema
Semantic search: structured, semi-structured, and
full text, all in one
Business intelligence: many sources, ease of adding
sources, no 6 month DW schema change cycle
E-science, often in life sciences: common interchange
format, nano-publications, NLP extracts, different users
cook their data differently, provenance
License CC-BY-SA 4.0 (International).
7. The Hopes and Perceptions
The age of ad hoc
Find insight in any data, when you need it,
from any source, any format
No data warehouse planning cycles; make your own
from the pieces you need, when you need it
Still, data integration remains hard work; quality and
coverage of sources vary
Flexibility may be there, but is performance and
scalability on the level?
License CC-BY-SA 4.0 (International).
8. Yes, But ...
Web and Big Data: Everybody reinvents the triple.
Self-description, long term identifiers, key-value pairs
in many non-RDF use cases
SPARQL and RDF would be the natural, standards-compliant
choice if did beat SQL, information retrieval,
custom big data, key value, map reduce solutions
Is this intrinsic to linked data or is this lack of engineering?
Linked data has unique advantages in breadth of
coverage and expressivity but performance must not lag
behind.
License CC-BY-SA 4.0 (International).
9. What is the RDF Tax?
90% of bad performance comes from
non-optimal query plans
Some comes from indexing too much
(e.g., SQL bulk load with no indices is 50x faster
than the equivalent in RDF with all indexed)
Some comes from string ops on URIs, literals
Some comes from having a join for every attribute.
Vectoring and right plans help, though
License CC-BY-SA 4.0 (International).
10. The Bane of the Triple
When data is stored as triples:
There is structure still but it is harder to exploit.
Schema re-emerges as correlations
More joins make more possible query plans,
bigger errors in plan cost estimation
More joining reduces locality
Lack of schema causes needless indexing;
data takes more space
A URI for everything takes space and time
For the same workload, Virtuoso SQL can also be 2–20x
faster than Virtuoso SPARQL
License CC-BY-SA 4.0 (International).
11. The Question is Raised
LOD2 FP7, now ending: RDF Performance parity with
relational?
SQL is the senior science. Who ignores history is bound
to repeat it
Integral mastery of RDB science is a prerequisite, but
do not forget the subtle twists of schema-less-ness
License CC-BY-SA 4.0 (International).
12. Virtuoso RDF Relational DBMS Leadership
2000–2006, v1.x–4.x: SQL row store with SQL
federation and XML
2007–2008, v5.x–6.x: SPARQL, adapted for RDF quads
with more compression, bitmap indices, special data
types, RDF awareness in query optimization
2009, v6.x: Scale-out cluster-capable
2010–2013, v7.x: Column store, vectored execution, 3x
more space efficient, 10+x more speed
2013: Star Schema benchmark with SPARQL, 100x
MySQL SQL, 0.8x MonetDB SQL
2014: Top of the line SQL analytics, 500 Gtriples,
Structure Awareness
License CC-BY-SA 4.0 (International).
13. Triples Done Right, so?
Column-store techniques are a good fit; index-based
triple storage does not get much better
RAM-only pointer-based techniques can be faster but
cost 10–100x more to scale up
To take RDF to SQL parity, Virtuoso must first be on the
level with the best in SQL
TPC-H is the checklist for mastery of DW and query
optimization; who survives shall not fear
Parity is achieved when running with triples, just like
with tables
License CC-BY-SA 4.0 (International).
14. Structure is Everywhere
CWI in LOD2:
90% of triples in Common Crawl fall into 20 tables
All relational extractions are 100% tables
Even DBpedia is 90% covered by 500 tables, but is
unusually heterogeneous, albeit not very large
License CC-BY-SA 4.0 (International).
15. The Glorious Dawn:
Structure is the Servant, not the Tyrant
A set of subjects with all the same single-valued
properties is in fact a table.
So, store it as a table
Allow exceptions, e.g., sometimes multiple values,
different values in different graphs, extra properties, etc.
If it is big, it has repeating structure
All RDF semantics are preserved; any triple is possible,
but the common ones are SQL compact and SQL fast
With tables, query optimization returns to SQL
complexity and is much more reliable
So, more tricks from the SQL analytics bag become
safe and applicable
License CC-BY-SA 4.0 (International).
16. Gains from Structure Awareness
3+x Load Speed
2x more space efficiency
SPARQL queries against regular data within 10–20% of
SQL speeds
Just declare which properties tend to occur together; no
strict schema-first like with SQL
Later, self configuration
License CC-BY-SA 4.0 (International).
17. The Cycle of Adventure
Rebels: SQL not cool, too rigid,
drop ACID, go key-value, map-reduce,
the triple is all there is,
semantic web
Pioneers: Life on the frontier is
hard, infrastructure missing or bad
Same everyday problems also in
Utopia
Recognizing the objective values,
e.g., schema freedom and
identifiers, no AI. Do the job,
forget dogma
Reconciliation: schema-first and
schema-last converge in structure
awareness
License CC-BY-SA 4.0 (International).
18. Present FP7 Research
LDBC — Transparency and Relevance for
Graph DB, RDF performance
GeoKnow — GeoData is everywhere,
how to carry the planet in your pocket
LOD2 — Where no triple has gone before
(and come back)
Open PHACTs — A Data Platform for
Drug Discovery
License CC-BY-SA 4.0 (International).
19. LDBC - Linked Data Benchmark Council
Rebels: SQL not cool, too rigid, drop ACID,
go key-value, map-reduce, the triple is all
there is, semantic web
Pioneers: Life on the frontier is hard,
infrastructure missing or bad
Same everyday problems also in Utopia
Recognizing the objective values, e.g.,
schema freedom and identifiers, no AI.
Do the job, forget dogma
Reconciliation: Some of the rebel thinking
becomes mainstream, e.g., schema-first and
schema-last converge in structure awareness
License CC-BY-SA 4.0 (International).
20. LDBC, Independent Industry Forum for
Benchmarking
The TPC for the frontiers of database
Bootstrapped in the LDBC FP7, continues
as independent industry association
OpenLink, Ontotext, Neo Technologies,
Sparsity as founding members
IBM, Oracle Labs, Systap, SPARQL City
already joined
DB superstars Peter Boncz and Thomas
Neumann as founders and scientific lead
License CC-BY-SA 4.0 (International).
21. LDBC Benchmarks
Social Network
Online — Lookups, updates, analysis of
social environment
Business Intelligence — Spotting trends,
key players, big query
Graph analytics — Community detection,
Page rank, graph metrics
Semantic Publishing
Modeled after the BBC linked data portal,
online lookups, drill downs and updates
License CC-BY-SA 4.0 (International).
22. GeoKnow - The Planet in your Pocket
Ms. Globe and Mr. Cube have a thing
going on:
Mr. Cube: Desiloization ... integrated
metadata ... Explicit semantics .
Ms. Globe: I can feel it ... but are you
man enough? ... you need to show me.
License CC-BY-SA 4.0 (International).
23. Planet Scale Roadmap
Jan 2014:
Virtuoso SPARQL outperforms PostGIS in map lookups
with planet-wide Open Street Map
Virtuoso SQL adds 5x more power
License CC-BY-SA 4.0 (International).
24. Next: Jan 2015
Parity between SPARQL and SQL via structure
awareness
Geospatial data clustering
Graph analytics close to the data — Pregel, Giraph,
etc., in the DB itself
Adding fine-grained geo dimension to LDBC social
network benchmark
License CC-BY-SA 4.0 (International).
25. The LOD2 scaling adventures
Experiments at CWI’s Scilens cluster
Jan 2013: 150 Gtriples (8 x 256GB
RAM)
Aug 2014: 500 Gtriples (12 x 256GB
RAM)
Some trillion-triple claims exist, but
do not detail any query workload
BSBM explore and BI workloads
10x speed gains for BI queries
between 2013 and 2014
Bulk load at 6M triples/s
All done in triples, structure
awareness will go further still
License CC-BY-SA 4.0 (International).
27. Virtuoso Now
Snapshot of RDF Linked Data customers in the Enterprise:
Data.Gov (U.S. Govt. Open
Linked Data initiative)
Bank of America
Booz Allen Hamilton
Northrop Grumman
Elsevier
French National Library
Samsung
Globo
Daimler Benz
Johnson & Johnson
Bayer
St Jude's Medical
Fuijitsu
Syngenta
and many more
License CC-BY-SA 4.0 (International).
28. Virtuoso Availability
Most capabilities as open source
Commercial adds
Cluster scale-out
SQL Federation
Replication (SQL & RDF)
Advanced RDF security; ABAC & RBAC (ACLs)
Wide tables
and more
Up to the minute tech previews via v7fasttrack on
github, e.g., superfast TPC-H implementation
License CC-BY-SA 4.0 (International).
29. Virtuoso Future
Preview of structure-aware RDF store
in fall 2014 via v7fasttrack
Integrated graph analytics framework
Embed complex graph algorithms, e.g., community
detection, shortest path inside SPARQL/SQL
Comparison of SQL and SPARQL for big data analytics
License CC-BY-SA 4.0 (International).
30. Linked Data Now
Adoption across major industries
Superior flexibility and time to solution
Dramatic performance gains in the last 5 years
Benchmarking will continue to drive progress, to the
benefit of users and vendors alike
Run circles around most open source SQL in SPARQL:
Virtuoso SPARQL beats MySQL in SSB by 100x
With structure awareness, SPARQL to match the best in
SQL for data warehousing, OLTP
Linked Data no longer a long shot but a technology that
makes sense
License CC-BY-SA 4.0 (International).
31. About OpenLink Software
OpenLink Software is a privately-held company founded in 1992 by its
President & CEO, Kingsley Idehen. The company is an industry acclaimed
technology innovator in the following areas:
ODBC, JDBC, ADO.NET, and
OLE DB compliant Data Access
Drivers for Oracle, Microsoft SQL
Server, Informix, Ingres, Sybase,
Progress, MySQL, and PostgreSQL
High-Performance & Scalable Multi-
License CC-BY-SA 4.0 (International).
Model (Relational & Graph)
Database Technology
Data Integration Middleware (Data
Virtualization Technology across a
wide variety of Protocols & Formats)
Socially-enhanced Distributed
Collaborative Applications Platforms
(Weblogs, Wikis, Feed Aggregation
and Syndication, Web File Systems,
Discussion Forums, etc.)
Web Application Server Technology
Linked Data Deployment &
Management
Identity Management
32. Office Locations
USA
OpenLink Software, Inc
10 Burlington Mall Road
Suite 265
Burlington, MA 01803
Tel.: +1 781 273 0900
Fax: +1 781 229 8030
UK
OpenLink Software Ltd.
Airport House
Purley Way
Croydon, Surrey CR0 0XZ
Tel.: +44 (0)20 8681 7701
Fax: +44 (0)20 8681 7702
License CC-BY-SA 4.0 (International).
33. Additional Information
Web Sites
OpenLink Software
YouID – Digital Identity Card (Certificate) Generator
OpenLink Data Spaces – Semantically enhanced Personal & Enterprise Data Spaces &
Collaboration Platform
OpenLink Virtuoso - Hybrid Data Management, Integration, Application, and Identity Server
Universal Data Access Drivers - High-Performance ODBC, JDBC, ADO.NET, and OLE-DB Drivers
LDAP and NetID-TLS – How to use LDAP scheme URIs with NetID-TLS Authentication
Social Media Data spaces
http://www.openlinksw.com/weblog/oerling/ (Orri Erling weblog)
http://kidehen.blogspot.com (Kingsley Idehen weblog)
http://www.openlinksw.com/blog/~kidehen/ (Kingsley Idehen weblog)
https://twitter.com/OpenLink (Twitter)
Hashtags: #LinkedData #SemanticWeb #BigData #RDF (Anywhere).
License CC-BY-SA 4.0 (International).