cristian_lai_webofdata

Query modeling and information retrieval within
the Web of Data
Cristian LAI
clai@crs4.it

CRS4

september 6, 2012

1 / 37

Outline

G

Motivation

G

UnStructured Data

G

Structured Data

G

Query building

G

Applications

G

Conclusion

september 6, 2012

2 / 37

Context
Semantic Web

http://www.w3.org/2006/Talks/1023-sb-W3CTechSemWeb/
september 6, 2012

3 / 37

Motivation
Search on the Web

http://www.slideshare.net/novaspivack/web-evolution-nova-spivack-twine
september 6, 2012

4 / 37

Outline

G

Motivation

G

UnStructured Data

G

Structured Data

G

Query building

G

Applications

G

Conclusion

september 6, 2012

5 / 37

Wikipedia

G
G

G

G
G

Started in 2001.
Is a multilingual, web-based, free-content encyclopedia project based on
an openly editable model.
Is the 5th site on the web and serves 454 million unique visitors monthly as
of March 2011.
Has fewer than 100 employees.
Wikipedia holds an annual fundraiser instead of accepting advertising. You
may have seen "A personal appeal from Wikipedia founder Jimmy Wales" if
you’ve used the online encyclopedia during the last weeks of 2011. Google
co-founder Sergey Brin and his wife, Anne Wojcicki, has given a 500,000
dollars grant to help Wikipedia fund its 28.3 million dollars annual budget.

september 6, 2012

6 / 37

Wikipedia

G

Pros:
H
H

H

G

Is a highly-efficient not-for-profit organization.
Is the finest example of truly collaborative created content: >19M articles;
>270 languages, >82k active contributors.
Covers many topics and domains, articles are a result of a community
consensus.

Cons:
H

Contains many inconsistencies.
G

H
H

Disclaimer: Wikipedia cannot guarantee the validity of the information found here.

Is not very well integrated with other data sources.
Queries and search are not facilitated due to the lacks of structured
representation.

september 6, 2012

7 / 37

Issues

G
G

UnStructured data, keywords based search.
Simple questions are hard to answer.
H
H
H

G
G

People who were born in Rome before 1900.
Italian musicians with English and French descriptions.
The ofﬁcial websites of companies with more than 500 employees.

The information required to answer these is contained in Wikipedia.
Transforming Wikipedia into a knowledge base.
H
H

To reveal the structure and semantics of Wikipedia content
The DBpedia project.

september 6, 2012

8 / 37

Structure in Wikipedia
G

Wikipedia articles consist mostly of free text, but also contain different
types of structured information, such as infobox templates,categorisation
information, images, geo-coordinates, and links to external Web pages.

G

Title

G

Abstract

G

Infobox Template

G

Geo-coordinates

G

Caegories

G
G

Images
Links
H
H
H
H

other language version
other Wikipedia pages
redirects
disambiguation
september 6, 2012

9 / 37

Structured Information in Wikipedia

september 6, 2012

10 / 37


september 6, 2012

11 / 37


september 6, 2012

12 / 37

Outline

G

Motivation

G

UnStructured Data

G

Structured Data

G

Query building

G

Applications

G

Conclusion

september 6, 2012

13 / 37

RDF representation
Knowledge Base

dbp:Cagliari rdf:type dbp:City
dbp:Cagliari dbp:Title "Cagliari"
dbp:Cagliari dbp:Country dbp:Italy
dbp:Cagliari dbp:postalCode 09100
dbp:Cagliari geo:lat "39.246387"xsd:float
dbp:Cagliari geo:long "9.057500"xsd:float
dbp:Cagliari rdf:type yago:MediterraneanPortCitiesAndTownsInItaly
...
G

An environment for collecting and structuring data.

G

Well defined structure of classification.

september 6, 2012

14 / 37

RDF

G
G

Triples: (subject, predicate, object)
Subject and object
H

are both URIs that each identify a resource, or a URI and a string literal
respectively.

H
G

Predicate
H

G

speciﬁes how the subject and object are related, and is also represented by a
URI.

For example:
H
H
H

A knows B
C isAuthorOf D
Two resources linked in this fashion can be drawn from different data sets on
the Web, allowing data in one data source to be linked to that in another,
thereby creating a Web of Data.

september 6, 2012

15 / 37

DBpedia

G
G

G
G

Started in 2007.
Is the result of a community effort to extract structured information from
Wikipedia.
Makes Wikipedia data available as RDF.
Results: The DBpedia Data Set
H

H
H
G

G

describes 3.64 million "things" with over half a billion "facts" (July 2011), 364k
persons, 462k places, 99k music albums, 54k ﬁlms, 148k organisations;
extraction in 97 different languages;
672M RDF triples

It is maintained by: Universität Leipzig, Freie Universität Berlin, OpenLink
Software, Inc.
See http://wiki.dbpedia.org/Team

september 6, 2012

16 / 37

Nucleus of the Web of Data

G
G

Within the W3C Linking Open Data (LOD) community effort.
Tim Berners-Lee’s Linked Data principles.
H
H
H
H

G

G

URI
HTTP
RDF, SPARQL
Interlinking among data providers

An increasing number of data providers have started to publish and
interlink data on the Web.
Several billion RDF triples and covers domains such as geographic
information, people, companies, online communities, ﬁlms, music, books
and scientiﬁc publications.

september 6, 2012

17 / 37

LOD Datasets

september 6, 2012

18 / 37

LOD Datasets

september 6, 2012

19 / 37

Outline

G

Motivation

G

UnStructured Data

G

Structured Data

G

Query building

G

Applications

G

Conclusion

september 6, 2012

20 / 37

SPARQL Query Language

G

G

G

G

RDF is a directed, labeled graph data format for representing information
(also in the Web).
SPARQL is a language for querying RDF graphs by specifying templates
against which to compare graph components. Data which matches or
satisﬁes a template is returned from the query.
A triple template contains variables that represent triplet components (e.g.,
?s, ?p, or ?o within a triplet).
Example:
H
H

H

?person ex:age "20"xsd:integer .
Identiﬁes a list of triplet subjects that have an ex:age property of "20".
Analogous to asking "Who has age 20?".
The SPARQL query engine will return a list of the subject component of triples
that satisfy each query through value substitution.

september 6, 2012

21 / 37

SPARQL Queries
SELECT variables_list
FROM < RDF_source_URL >
WHERE {
{ triple_pattern_1 .
. . .
triple_pattern_n . }.
}
SELECT ?person

?person

FROM < http://ex.com >

------------------

WHERE {
?person ex:age "20"xsd:integer .

_p1
_p2
. . .

}

september 6, 2012

22 / 37

The DBpedia SPARQL endpoint

G

G

All data sets are available for queries via the DBpedia SPARQL endpoint
(http://dbpedia.org/sparql).
Querying the data set:
H
H
H
H
H

...
Abstracts of movies starring Tom Cruise, released before 1999.
The ofﬁcial websites of companies with more than 50000 employees.
Cities with more than 2 million habitants.
...

september 6, 2012

23 / 37

Abstracts of movies starring Tom Cruise, released before
1999
SPARQL

SELECT ?subject ?label ?released ?abstract WHERE {
?subject rdf:type <http://dbpedia.org/ontology/Film>.
?subject dbpedia2:starring <http://dbpedia.org/resource/Tom_Cruise>.
?subject rdfs:comment ?abstract.
?subject rdfs:label ?label.
FILTER(lang(?abstract) = "en" && lang(?label) = "en").
?subject <http://dbpedia.org/ontology/releaseDate> ?released.
FILTER(xsd:date(?released) < "2000-01-01"^^xsd:date).
} ORDER BY ?released

september 6, 2012

24 / 37

Outline

G

Motivation

G

UnStructured Data

G

Structured Data

G

Query building

G

Applications

G

Conclusion

september 6, 2012

25 / 37

Linked Data Search Engines and Indexes

G

A number of search engines have been developed that crawl Linked Data
from the Web by following RDF links, and provide query capabilities over
aggregated data.
Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web:
Theory and Technology, 1:1, 1-136. Morgan & Claypool.

G

G

Google, Bing and Yahoo! agree to create and support a common
vocabulary for structured data markup on web pages.
Facebook has started to support RDF and Linked Data URIs and now
provides access to parts of its user data via a Linked Data API.

september 6, 2012

26 / 37

Google rich snippets

september 6, 2012

27 / 37

Twitter, #annotations
Twitter API based client

september 6, 2012

28 / 37

Lookup annotations

september 6, 2012

29 / 37

Resource #dbpedia:Cagliari

september 6, 2012

30 / 37

Resource #dbpedia:Cagliari

september 6, 2012

31 / 37

Question answering
Risorsa Cagliari

september 6, 2012

32 / 37

Question answering
Template

september 6, 2012

33 / 37

Question answering
RDF/XML

september 6, 2012

34 / 37

Outline

G

Motivation

G

UnStructured Data

G

Structured Data

G

Query building

G

Applications

G

Conclusion

september 6, 2012

35 / 37

Conclusion

G

G

G
G

Data on the Web is a major challenge; technologies are needed to use
them, to interact with them, to integrate them.
Semantic Web technologies (RDF, SPARQL, etc.) can play a major role in
publishing and using Data on the Web.
Users can largely beneﬁt from the wide world of structured content.
Content providers joining the Linking Open Data project are contributing
to create more meaningful navigation paths not only within websites but
across the whole web.

september 6, 2012

36 / 37

Q&A

september 6, 2012

37 / 37

cristian_lai_webofdata

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à cristian_lai_webofdata

Similaire à cristian_lai_webofdata (20)

Dernier

Dernier (20)

cristian_lai_webofdata