1. Real Time Semantic
Warehousing: Sindice.com
technology for the enterprise
Giovanni Tummarello, Ph.D
Data Intensive Infrastructure UNIT -
DERI.ie
CEO SindiceTech
2. How we started : Sindice.com
80 Billions triple, 500,000,000 RDF Graphs, 5 TB of data.
The Sindice Suite powers Sindice.com. Online with 99,9%+
3. Semantic Sandboxes on: Sindice.com
Data Sandboxes in Sindice.com – Powered by CloudSpaces
4. And then we met people asking
can you do it for us
5. Example story (Pharmaceutical company0
To stay competitive, Pharmaceutical companies need to leverage all the data available from
inside sources as well as from the increasingly many public HCLS data sources available. Due to
the diversity of this data with respect to nature, formats, quality, there are complex integration
issues. Traditional data warehousing technology require big upfront thinking and is handled
within a company in the “go via the IT department” approach. This does not meet the need of
data scientists who are the only ones that can do the complex cross-use case thinking required.
Via Real Time Semantic Data Warehousing (RETIS) data scientist expect to get:
• The ability to speed up “In silico” scientific workflows (interrelation of diverse large
datasets) by orders of magnitude by relying on a data warehousing approach.
• The ability to create large scale “data maps” or “aggregated views” which would allow
researchers to see “trends” and gather insights at high level which would not be possible by
data accessed via single lookups.
• The ability to receive recommendations and suggestions for new data connections based on
an ever evolving ecosystem of available experimental datasets.
• Provide their R&D departments with superior tools for investigating their internal
knowledge; search engines and data browsing tools which provide unified views of multiple,
evolving, live datasets without leakage of specific “queries” to the outside world which would
reveal internal research trends
• The ability to leverage the ever increasing body of public, crowd curated open data
5 of 16
6. Linked Data clouds for the Enterprise
– Strategic knowledge spaces, where new
databases can be added and “leveraged” with an
unprecedented ease
– Integration “Pay as you go” : explore now, fine
tune later.
– Its BigData (Cluster+Clouds) meets RDF and
Semantic Technologies
9. A Dataspace Template
Semantic Web
A typical implementation template.
Data
Dataspaces own:
• Resources
• Services
• Datasets for others to reuse
10. Dataspace Composition
Scalable cascading semantic ‘Dataspaces”
• Resources allocated in public/private clouds
• Allow to get Sindice Data and mix it/ process it for private purposes
10 of 16
12. Scale is only 1 dimension
Multiple dimensions of WeD data integration
• RDF tool stack flexibility
• Cluster scalable processing scalability
• “Cloud” Pipelines dynamicity
13. Full Json Like Search.
On Solr.
All operators supported.
14. What is SIREn ?
• Plugin to Solr
• Built for searching and operating on
semistructured data and relational
datastructures
15. SIREn: Semantic IR Engine
• Extension to Enterprise Search Engine Solr
• Semantic, full-text, incremental updates,
distributed search
Semantic
SIREn
Databases
Constant time
16. Limitations of Apache Solr
• Not efficient with highly heterogeneous
structured data sources
– Limitation on the number of attributes:
Dictionary size explosion
18. Dictionary Size Explosion
Dictionary
label:renaud
Record 1
label Renaud Delbru label:delbru
name Renaud Delbru name:renaud
name:delbru
Dictionary construction
Concatenation of attribute name and term
N * M complexity (worst case)
2 attributes * 2 terms = 4 dictionary entries
100K attributes * 1B terms = 100B entries
19. Limitations of Apache Solr
• Not efficient with highly heterogeneous
structured data sources
– Limitation on the number of attributes:
Dictionary size explosion
Query clause explosion when searching across all
attributes
20. Limitations of Apache Solr
• Not efficient with highly heterogeneous
structured data sources
– Limitation on the number of attributes:
Dictionary size explosion
Query clause explosion when searching across all
attributes
• Limited support for structured query
– Multi-valued attributes
21. Multi-valued attributes
• No support in Solr for "all words must match
in the same value of a multi-valued field".
• A field value is a bag of words
– No distinction between multiple values
Record 1 Record 2
label man's best pooch label man's worst friend to no one
friend enemy
22. Multi-valued attributes
• No support in Solr for "all words must match
in the same value of a multi-valued field".
• A field value is a bag of words
– No distinction between multiple values
• Query example
– label : man’s friend
– Solr returns Record 1 & 2 as results
Record 1 Record 2
label man's best friend pooch label man's worst enemy friend to no one
23. Limitations of Apache Solr
• Not efficient with highly heterogeneous
structured data sources
– Limitation on the number of attributes:
Dictionary size explosion
Query clause explosion when searching across all
attributes
• Limited support for structured query
– Multi-valued attributes
– No full-text search on attribute names
24. Full-text search on attribute names
• No support in Solr for “keyword search in
attribute names".
• Query example
– (name OR label) = “Renaud Delbru”
– Solr is unable to find the records without the exact
attribute name
Record 1 Record 2
rdfs:label Renaud Delbru foaf:name Renaud Delbru
Record 3 Record 4
sioc:name Renaud Delbru full_name Renaud Delbru
25. Limitations of Apache Solr
• Not efficient with highly heterogeneous
structured data sources
– Limitation on the number of attributes:
Dictionary size explosion
Query clause explosion when searching across all
attributes
• Limited support for structured query
– Multi-valued attributes
– No full-text search on attribute names
– No 1:N relationship materialisation
29. Introducing large scale RDF ‘Summaries”
We do it for:
• Data exploration
– How to find datasets about movies ?
• Assisted SPARQL Query Editor
– What is the data structure ?
• Dataset Quality
– How to differentiate relevant form irrelevant
dataset ?
30. Large Scale RDF summaries
Class Level
12M relationships
10B relationships
34. Thank you
Sindice.com team April 2012
With the contribution of
Notes de l'éditeur
Search record (instead of entity)Record-centric indexing model
Use Case: Let’s index the entire web of dataDoc/s, lucene in action, uptime, etc.
How important a dataset is to my information need ?How to help users to browse and filter irrelevant datasets ?How can I measure the quality of a dataset ? Data quality, objective measuresTwo datasets can overlap, provide similar information, but one dataset is providing more fresh information, is updated more frequently.Concrete scenarios to test such assumptionsData Quality can be also useful for improving data acquisition, optimising resources to retrieve only top quality data
- Define “relationships” when introducing the graph, BEFORE talking about the numbers
Number of entities per classNumber of relations of a certain predicateOther metadata can be added to a class, e.g., other predicates used with the entities of that class