Querying Linked Data with SPARQL (2010)

Querying
Linked Data
with
SPARQL

WWW 2010 Tutorial "How to Consume Linked Data on the Web"

Brief Introduction to SPARQL
● SPARQL: Query Language for RDF data*
● Main idea: pattern matching
● Describe subgraphs of the queried RDF graph
● Subgraphs that match your description yield a result
● Mean: graph patterns (i.e. RDF graphs /w variables)

?v rdf:type
http://.../Volcano

* http://www.w3.org/TR/rdf-sparql-query/

Brief Introduction to SPARQL
Queried
graph:
rdf:type
http://.../Mount_Baker http://.../Volcano
p:lastEruption rdf:type
"1880" http://.../Mount_Etna

?v rdf:type
Results: http://.../Volcano
?v
http://.../Mount_Baker
http://.../Mount_Etna

SPARQL Endpoints
● Linked Data sources usually provide a
SPARQL endpoint for their dataset(s)
● SPARQL endpoint: SPARQL query processing
service that supports the SPARQL protocol*
● Send your SPARQL query, receive the result

* http://www.w3.org/TR/rdf-sparql-protocol/


SPARQL Endpoints
Data Source Endpoint Address

DBpedia http://dbpedia.org/sparql

Musicbrainz http://dbtune.org/musicbrainz/sparql

U.S. Census http://www.rdfabout.com/sparql

Semantic Crunchbase http://cb.semsol.org/sparql

More complete list:
http://esw.w3.org/topic/SparqlEndpoints

Accessing a SPARQL Endpoint
● SPARQL endpoints: RESTful Web services
● Issuing SPARQL queries to a remote SPARQL
endpoint is basically an HTTP GET request to
the SPARQL endpoint with parameter query

GET /sparql?query=PREFIX+rd... HTTP/1.1
Host: dbpedia.org
User-agent: my-sparql-client/0.1
URL-encoded string
with the SPARQL query

Query Results Formats
● SPARQL endpoints usually support different
result formats:
● XML, JSON, plain text
(for ASK and SELECT queries)
● RDF/XML, NTriples, Turtle, N3
(for DESCRIBE and CONSTRUCT queries)


Query Results Formats
PREFIX dbp: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>

SELECT ?name ?bday WHERE {
?p dbp:birthplace <http://dbpedia.org/resource/Berlin> .
?p dbpprop:dateOfBirth ?bday .
?p dbpprop:name ?name .
}
name | bday
------------------------+------------
Alexander von Humboldt | 1769-09-14
Ernst Lubitsch | 1892-01-28
...

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="name"/>
<variable name="bday"/>
</head>
<results distinct="false" ordered="true">
<result>
<binding name="name">
<literal xml:lang="en">Alexander von Humboldt</literal>
</binding>
<binding name="bday">
<literal datatype="http://www.w3.org/2001/XMLSchema#date">1769-09-14</literal>
</binding>
</result>
<result>
<binding name="name">
<literal xml:lang="en">Ernst Lubitsch</literal>
</binding>
<binding name="bday">
<literal datatype="http://www.w3.org/2001/XMLSchema#date">1892-01-28</literal>
</binding>
</result> http://www.w3.org/TR/rdf-sparql-XMLres/

</results>
</sparql>

{
"head": { "link": [], "vars": ["name", "bday"] },
"results": { "distinct": false, "ordered": true, "bindings": [
{ "name": { "type": "literal",
"xml:lang": "en",
"value": "Alexander von Humboldt" } ,
"bday": { "type": "typed-literal",
"datatype": "http://www.w3.org/2001/XMLSchema#date",
"value": "1769-09-14" }
},
{ "name": { "type": "literal",
"xml:lang": "en",
"value": "Ernst Lubitsch" } ,
"bday": { "type": "typed-literal",
"datatype": "http://www.w3.org/2001/XMLSchema#date",
"value": "1892-01-28" }
},
// ...
] } http://www.w3.org/TR/rdf-sparql-json-res/
}

Query Result Formats
● Use the ACCEPT header to request the
preferred result format:
GET /sparql?query=PREFIX+rd... HTTP/1.1
Host: dbpedia.org
Accept: application/sparql-results+json


Query Result Formats
● As an alternative some SPARQL endpoint
implementations (e.g. Joseki) provide an
additional parameter out

GET /sparql?out=json&query=... HTTP/1.1
Host: dbpedia.org


● More convenient: use a library
● Libraries:
● SPARQL JavaScript Library
http://www.thefigtrees.net/lee/blog/2006/04/sparql_calendar_demo_a_sparql.html
● ARC for PHP
http://arc.semsol.org/
● RAP – RDF API for PHP
http://www4.wiwiss.fu-berlin.de/bizer/rdfapi/index.html


● Libraries (cont.):
● Jena / ARQ (Java) http://jena.sourceforge.net/
● Sesame (Java) http://www.openrdf.org/
● SPARQL Wrapper (Python)
http://sparql-wrapper.sourceforge.net/
● PySPARQL (Python)
http://code.google.com/p/pysparql/


● Example with Jena / ARQ:
import com.hp.hpl.jena.query.*;

String service = "..."; // address of the SPARQL endpoint
String query = "SELECT ..."; // your SPARQL query
QueryExecution e = QueryExecutionFactory.sparqlService( service,
query );
ResultSet results = e.execSelect();
while ( results.hasNext() ) {
QuerySolution s = results.nextSolution();
// …
}
e.close();

● Querying a single dataset is quite boring
compared to:
● Issuing SPARQL queries over multiple datasets

● How can you do this?
1. Issue follow-up queries to different endpoints
2. Querying a central collection of datasets
3. Build store with copies of relevant datasets
4. Use query federation system


Follow-up Queries
● Idea: issue follow-up queries over other
datasets based on results from previous
queries
● Substituting placeholders in query templates


String s1 = "http://cb.semsol.org/sparql";
String s2 = "http://dbpedia.org/sparql";

String qTmpl = "SELECT ?c WHERE{ <%s> rdfs:comment ?c }";

String q1 = "SELECT ?s WHERE { ...";
QueryExecution e1 = QueryExecutionFactory.sparqlService(s1,q1);
ResultSet results1 = e1.execSelect();
while ( results1.hasNext() ) {
QuerySolution s1 = results.nextSolution();
String q2 = String.format( qTmpl, s1.getResource("s"),getURI() );
QueryExecution e2= QueryExecutionFactory.sparqlService(s2,q2);
ResultSet results2 = e2.execSelect();
while ( results2.hasNext() ) {
// ...
} Find a list of companies
e2.close();
}
filtered by some criteria and
e1.close(); return DBpedia URIs of them

Follow-up Queries
● Advantage:
● Queried data is up-to-date
● Drawbacks:
● Requires the existence of a SPARQL endpoint for
each dataset
● Requires program logic
● Very inefficient


Querying a Collection of Datasets
● Idea: Use an existing SPARQL endpoint that
provides access to a set of copies of relevant
datasets
● Example:
● SPARQL endpoint by OpenLink SW over a majority
of datasets from the LOD cloud at:
http://lod.openlinksw.com/sparql


Querying a Collection of Datasets
● Advantage:
● No need for specific program logic
● Drawbacks:
● Queried data might be out of date
● Not all relevant datasets in the collection


Own Store of Dataset Copies
● Idea: Build your own store with copies of
relevant datasets and query it
● Possible stores:
● Jena TDB http://jena.hpl.hp.com/wiki/TDB
● Sesame http://www.openrdf.org/
● OpenLink Virtuoso http://virtuoso.openlinksw.com/
● 4store http://4store.org/
● AllegroGraph http://www.franz.com/agraph/
● etc.


Populating Your Store
● Get RDF dumps provided for the datasets
● (Focussed) Crawling

● ldspider http://code.google.com/p/ldspider/
● Multithreaded API for focused crawling
● Crawling strategies (breath-first, load-balancing)
● Flexible configuration with callbacks and hooks


Own Store of Dataset Copies
● Advantages:
● Can include all datasets
● Independent of the existence, availability, and
efficiency of SPARQL endpoints
● Drawbacks:
● Requires effort to set up and to operate the store
● Ideally, data sources provide RDF dumps; if not?
● How to keep the copies in sync with the originals?
● Queried data might be out of date

Federated Query Processing
● Idea: Querying a mediator which ?
distributes subqueries to
relevant sources and
integrates the results
?
? ?


● Instance-based federation
● Each thing described by only one data source
● Untypical for the Web of Data
● Triple-based federation
● No restrictions
● Requires more distributed joins

● Statistics about datasets required (both cases)


● DARQ (Distributed ARQ)
http://darq.sourceforge.net/
● Query engine for federated SPARQL queries
● Extension of ARQ (query engine for Jena)
● Last update: June 28, 2006


● Semantic Web Integrator and Query Engine
(SemWIQ) http://semwiq.sourceforge.net/
● Actively maintained by Andreas Langegger


● Advantages:
● Queried data is up to date
● Drawbacks:
● Requires the existence of a SPARQL endpoint for
each dataset
● Requires effort to set up and configure the mediator


In any case:
● You have to know the relevant data sources
● When developing the app using follow-up queries
● When selecting an existing SPARQL endpoint over
a collection of dataset copies
● When setting up your own store with a collection of
dataset copies
● When configuring your query federation system
● You restrict yourself to the selected sources


In any case:
● You have to know the relevant data sources
● When developing the app using follow-up queries
● When selecting an existing SPARQL endpoint over
a collection of dataset copies
● When setting up your own store with a collection of
dataset copies
● When configuring your query federation system
● You restrict yourself to the selected sources
There is an alternative:
Remember, URIs link to data

Automated
Link Traversal


Automated Link Traversal
● Idea: Discover further data by looking up
relevant URIs in your application
● Can be combined with the previous approaches


Link Traversal Based
Query Execution
● Applies the idea of automated link traversal to the
execution of SPARQL queries
● Idea:
● Intertwine query evaluation with traversal of RDF links
● Discover data that might contribute to query results
during query execution
● Alternately:
● Evaluate parts of the query
● Look up URIs in intermediate solutions

Queried data

Query Execution
SELECT ?c ?u WHERE {
<http://mymovie.db/movie2449> mov:filming_location ?c .
?c geo:statistics ?cStats .
?cStats stat:unempRate ?u . }

● Example:
Return unemployment rate of the countries in
which the movie http://mymovie.db/movie2449
was filmed.

Queried data

Query Execution
49
v ie24
?cStats stat:unempRate ?u . } .d b/mo
m ovie
http ://my ?

Queried data

Query Execution

Queried data

Query Execution

...
<http://mymovie.db/movie2449>
mov:filming_location <http://geo.../Italy> .
Queried data
...

Query Execution
?c geo:statistics ?cStats . ?loc
?cStats stat:unempRate ?u . } http://geo.../Italy

...
<http://mymovie.db/movie2449>
mov:filming_location <http://geo.../Italy> .
Queried data
...

Query Execution
taly
o.../I
/ / ge ?
http:

Queried data

Query Execution
?cStats stat:unempRate ?u . } ly
http://geo.../Italy
eo .../Ita
http://g ?

Queried data

Query Execution

Queried data

Query Execution

...
<http://geo.../Italy>
geo:statistics <http://example.db/stat/IT> .
... Queried data

Query Execution

?loc ?stat
http://geo.../Italy http://stats.db/../it

...
<http://geo.../Italy>
geo:statistics <http://example.db/stat/IT> .
... Queried data

Query Execution

?loc ?stat
http://geo.../Italy http://stats.db/../it

● Proceed with this strategy
(traverse RDF links
during query execution)

Queried data

Query Execution
● Advantages:
● No need to know all data sources in advance
● No need for specific programming logic
● Queried data is up to date
● Does not depend on the existence of SPARQL
endpoints provided by the data sources
● Drawbacks:
● Not as fast as a centralized collection of copies
● Unsuitable for some queries
● Results might be incomplete

Implementations
● Semantic Web Client library (SWClLib) for Java
http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/
● SWIC for Prolog http://moustaki.org/swic/


Implementations
● SQUIN http://squin.org
● Provides SWClLib functionality as a Web service
● Accessible like a SPARQL endpoint
● Public SQUIN service at:
http://squin.informatik.hu-berlin.de/SQUIN/
● Install package: unzip and start
● Convenient access with SQUIN PHP tools:

$s = 'http:// …'; // address of the SQUIN service
$q = new SparqlQuerySock( $s, '… SELECT ...' );
$res = $q->getJsonResult(); // or getXmlResult()

Real-World Examples
SELECT DISTINCT ?author ?phone WHERE {
?pub swc:isPartOf
<http://data.semanticweb.org/conference/eswc/2009/proceedings> .
?pub swc:hasTopic ?topic . ?topic rdfs:label ?topicLabel .
FILTER regex( str(?topicLabel), "ontology engineering", "i" ) .

# of query results 2
?pub swrc:author ?author . # of retrieved graphs 297
{ ?author owl:sameAs ?authorAlt } # of accessed servers 16
UNION avg. execution time 1min 30sec
{ ?authorAlt owl:sameAs ?author }
Return
?authorAlt foaf:phone ?phone . phone numbers of authors
of ontology engineering papers
}
at ESWC'09.

Querying Linked Data with SPARQL (2010)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (14)

Similaire à Querying Linked Data with SPARQL (2010)

Similaire à Querying Linked Data with SPARQL (2010) (20)

Plus de Olaf Hartig

Plus de Olaf Hartig (15)

Dernier

Dernier (20)

Querying Linked Data with SPARQL (2010)