The document describes LOTUS, a system for finding Linked Open Data (LOD) resources based on natural text queries. LOTUS indexes over 5 billion text literals from the LOD Laundromat. It supports various query modes like phrase matching and term matching. Evaluation shows LOTUS retrieves more resources compared to SPARQL and previous approaches. Current work involves adding language tags and improving ranking. LOTUS provides a single access point for exploring the LOD cloud via natural text queries.
3. Consuming LD
Finding relevant LD resources based on natural
text
Central for application areas:
Information Retrieval
Named Entity Linking
“central indices (e.g. Sindice) have disappeared”
4. “HELR : The Harvard
Environmental Law Review”
Let’s play a game ...
Find Linked Open Data resources for:
5. How do we find relevant resources on the
Semantic Web today ?
literals are not dereferenceable by definition
1) Dereference
2) SPARQL endpoints
find resources only in explicitly stated set of data sets
exact or substring/regex matching
8. Summarising: Findability is a problem on SW today
We need :
● a single entry point to the
Linked Open Data cloud
● to find resources based on
approximate text matching
9. Towards the Findability problem of the SW
We need :
● a single entry point to the
Linked Open Data
● to find resources based on
approximate text matching
LOD Laundromat
LOTUS
10. #1 LOD Laundromat
Infrastructure that washes other people’s dirty
data and republishes it as RDF
Central entry point to the Linked Open Data cloud
12. #2 LOTUS
Full-text lookup index over LOD Laundromat
Finds resources based on
associated natural text
Inspired by application areas:
IR and NED
13. LOTUS’ approach
Text2Literal mapping
(and onwards to documents and resources)
for described resources
(with at least one associated literal)
that contain natural text
(numbers and dates are not findable)
through a rich string approximation model.
(substring, phonetic, synonym matching,
TF-IDF scoring, match granularity )
20. PHRASE: substring matching
phrase(“Harvard Environmental Law Review”)
TERMS: lookup a set of terms
terms(“HELR. Harvard ELR Environmental Law Review”)
*optionally, supply a langtag:
phrase(“Harvard Environmental Law Review”, “en”)
Query modes
22. Preliminary Evaluation
191 local monuments, manually extracted from Dutch
tour guide
List of 231 scientific journals from a Norwegian Social
Sciences Data Services website
24. Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments
Scientific
journals
Overall %
191 231
in DBpedia (via SPARQL) 53 77 30.8%
25. Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments
Scientific
journals
Overall %
191 231
in DBpedia (via SPARQL) 53 77 30.8%
in DBPedia
(via LOTUS phrase)
165 182 82.2%
26. Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments
Scientific
journals
Overall %
191 231
in DBpedia (via SPARQL) 53 77 30.8%
in DBPedia
(via LOTUS phrase)
165 182 82.2%
in LOD (via LOTUS phrase) 168 216 91.0%
27. Preliminary Evaluation
Text queries for which we find at least one resource
Local Monuments
Scientific
journals
Overall %
191 231
in DBpedia (via SPARQL) 53 77 30.8%
in DBPedia
(via LOTUS phrase)
165 182 82.2%
in LOD (via LOTUS phrase) 168 216 91.0%
in LOD (via LOTUS terms) 188 231 99.3%
28. Start towards a natural text index over the LOD cloud
5.3B indexed literals can be looked up
Query modes for approximate matching
Accessible through web frontend and API
LOTUS v1.0
29. Current work (LOTUS v1.1)
Add langtags through Automatic language detection
Extract knowledge base information from URIs
Extract meaning of formatting convention from URIs
Add conjunctive & fuzzy query modes
30. Future work
Evaluation of precision
◎ task-specific (IR, NED)
Integration of structured and unstructured data
Relevance and ranking
42. LOTUS vs Sindice
Sindice LOTUS
Relate URIs and literals to documents Relate URIs, literals and documents to each
other
Accepts URIs which can be dereferenceable
or have a SPARQL endpoint
Accepts any type of data
Partially incorrect datasets are excluded Partially incorrect datasets are included
Relies on original URI availability Original URI can be ‘down’
30M URIs & 45M literals 3,700M URIs & 5,320M literals
43. You will have a bad time finding these via SPARQL
“National Socialist German Workers' Party Foreign
Organisation”
“The NSDAP/AO was the Foreign Organization of the
National Socialist German Workers Party (NSDAP).”
“De 9 straatjes”
“Negen straatjes (Amsterdam), 9 straatjes”@nl
“Shopping guide: negen straatjes”@nl-NL
44. You will have a bad time finding these in SPARQL
"1375 W Lake Street"
"1501 W. Randolph St."
"29 North 7th Street"
"Fritz-Pregl-Str. 5"@en
"33-35 Stoke Newington Road"
"Trompsingel 27"
"226 Broadway, 2nd Floor"
"Shinbo Building, 402-22, B1 Seogyo-dong, Mapo-gu"
45. Preliminary Evaluation (recall)
% of DBpedia resources in top 100 results
Local Monuments Scientific journals
LOTUS phrase 70.48% 24.83%
LOTUS terms 67.19% 22.33%