2. Scenario
• People who are politicians and actors
• Who else?
• Where do they live?
• Whom do they know?
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 2 of 12
3. Problem
• Execute those queries on the LOD cloud
• No single federated query interface provided
“politicians
and actors”
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 3 of 12
4. Principle Solution
• Suitable index structure for looking up sources
“politicians
and actors”
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 4 of 12
5. The Naive Approach
1. Download the entire LOD cloud
2. Put it into a (really) large triple store
3. Process the data and extract schema
4. Provide lookup
- Big machinery
- Late in processing the data
- High effort to scale with LOD cloud
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 5 of 12
6. Yes, we can …
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 6 of 12
7. The SchemEX Approach
• Stream-based schema extraction
• While crawling the data
FIFO
LOD-Crawler Instance-
RDF-Dump Cache
RDF
Triple Store RDBMS
NxParser
Nquad- Schema-
Parser Schema
Stream Extractor
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 7 of 12
8. Efficient Instance Cache
• Observe a quadruple stream from LD spider
• Ring queue, backed up by a hash map
• Organizes triples with same subject URI
• Dismiss oldest, when cache full (FIFO)
→ Runtime complexity O(1)
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 8 of 12
9. Building the Schema and Index
RDF
c1 c2 c3 … ck
classes
consistsOf
Type
TC1 TC2 … TCm clusters
hasEQ
Class p1 p2
EQC1 EQC2 … EQCn Equivalence
classes
hasDataSource
… Data
DS1 DS2 DS3 DS4 DS5 DSx sources
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 9 of 12
10. Computing SchemEX: TimBL Data Set
• Analysis of a smaller data set
• 11 M triples, TimBL’s FOAF profile
• LDspider with ~ 2k triples / sec
• Different cache sizes: 100, 1k, 10k, 50k, 100k
• Compared SchemEX with reference schema
• Index queries on all Types, TCs, EQCs
• Good precision/recall ratio at 50k+
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 10 of 12
11. Computing SchemEX: Full BTC 2011 Data
Cache size: 50 k
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 11 of 12
12. Conclusions: SchemEX
• Stream-based approach to schema extraction
• Scalable to arbitrary amount of Linked Data
• Applicable on commodity hardware
(4GB RAM, standard single CPU)
• Lookup-index to find relevant data sources
• Support federated queries on the LOD cloud
SchemEX – Mathias Konrath, Thomas Gottron, Ansgar Scherp 12 of 12