1. Semantic Web:
Comparison of SPARQL
implementations
Rafał Małanij
Mat.No: B0105363
Thesis Project for the partial fulfilment of the requirements for the Master Degree
in Advanced Computer Systems Development.
University of The West of Scotland
School of Computing
29th September 2008
2. Abstract
The Semantic Web is the revolutionary approach to publishing data in the Internet proposed years
ago by Tim Berners-Lee. Unfortunately the deployment of the idea became more complex than
it was assumed. Although the data model for the concept is well established recently a query
language has been announced. The specification of SPARQL was a milestone on the way to fulfil
the vision, but the implementation attempts show that there is a need for further research in the
area. Some of the products are already available. This thesis is evaluating five of them using the
data set based on DBpedia.org. Firstly each of the packages is described taking into consideration
the documentation, the architecture and usability. The second part is testing the ability to load
efficiently a significant amount of data and afterwards to compute in reasonable time results of
the sample queries, which includes the most important structures of the language. The conclusion
shows that although some of the packages seem to be very advanced and complex products, they
still have some problems with processing queries based on basic specification. The Semantic Web
and its key technologies are very promising, but they need some more stable implementations to
become popular.
1
7. LIST OF FIGURES
3.28. Pyrrho Database Manager showing local database sparql with the data stored in
Rdf$ table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.29. High-level class diagram of AllegroGraph. Source: AllegroGraph RDFStore (2008).119
3.30. The process of loading AllegroGraph server and querying a repository using Alle-
gro CL environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.31. Graph comparing average loading times the best performing configurations. . . . . . . 133
6
8. LIST OF TABLES
List of Tables
3.1. Summary of loading data into OpenRDF Sesame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2. Summary of evaluating test queries on OpenRDF Sesame. . . . . . . . . . . . . . . . . . . . . . 78
3.3. Summary of loading data into OpenLink Virtuoso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.4. Summary of evaluating test queries on OpenLink Virtuoso. . . . . . . . . . . . . . . . . . . . . 90
3.5. Summary of loading data using SDB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.6. Summary of evaluating test queries on repositories managed by SDB.. . . . . . . . . . . . 106
3.7. Summary of evaluating test queries against Pyrrho Professional. . . . . . . . . . . . . . . . . 116
3.8. Summary of loading data into AllegroGraph repository. . . . . . . . . . . . . . . . . . . . . . . . 123
3.9. Summary of evaluating test queries on AllegroGraph RDFStore. . . . . . . . . . . . . . . . . 125
3.10. Summary of loading data into tested implementations – configurations that had
the best performance for each implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.11. Summary of performing test queries – configurations that had the best perfor-
mance for each implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7
9. INTRODUCTION
Introduction
In the late 1980’s the Internet was becoming internationally established. However retrieving in-
formation from remote computer systems was a challenge due to the lack of unified protocol
for accessing information. In the same time Tim Berners-Lee, a physicist in CERN Laboratory in
Switzerland, started to work on a protocol that would allow easier access to information distributed
over many computers. In 1989, with help from Robert Cailliau, Tim Berners-Lee published a pro-
posal for the new service - World Wide Web. That was the beginning of the revolution. Within a
few years WWW became the most popular service in the Internet.
In 1994 Tim Berners-Lee launched a World Wide Web Consortium (W3C) that started to work on
standardising the technologies that were to extend the functionality of WWW. That was the time
when webpages became dynamic, but the “golden years” were to come soon. WWW was spotted
by the business community and the revolution was spread around the world.
Now we can truly say that hyperlinks have revolutionised our life - the way we publish infor-
mation, media, the way we buy and sell goods, the way we communicate. Almost everybody in
developed countries has personalised email address and treats the Internet as regular tool that helps
in everyday life. We can undoubtedly agree that the Internet is one of the pillars of the revolution
that is transforming the developed world into a knowledge-driven society.
However some visionaries claim that this is not yet the Web of data and information. The meaning
of today’s Web content is only accessible for humans. Although search engines have become very
powerful tools, the quality of the search results is relatively low. What is more, the results contains
only links to webpages, where possibly the information may be found. Users still play the main
role in processing information published in the Internet.
Tim Berners-Lee was aware of all the imperfections of the Web. In the end of the 1990’s he pro-
posed the extension to the current Web that he called the Semantic Web. The specialists announced
8
10. INTRODUCTION
a revolution – Web 3.0. However the implementation of that vision turned out to be more complex
than expected. The revolution was replaced by evolution.
In this thesis I will focus on one of the aspects of Semantic Web – handling semantic data. Firstly
the vision of the Semantic Web along with basic technologies will be presented. Then I will
examine what expectations derive from the Semantic Web’s foundation for the technologies that
will be responsible for accessing data on the Web. In the following chapter the W3C’s approach,
SPARQL query language, will be presented together with a short introduction into semantic data
model and the problem of querying the Semantic Web. SPARQL will be discussed in details
including the syntax, the implementation models and a review of available literature about the
technology. The practical part of the research will involve a review of a number of available
implementations of SPARQL, which are going to be a subject of some basic usability tests. Firstly
the methodology will be presented together with a description of the data set used for testing. Then
each of examined implementations will be reviewed and tested presenting the findings. Finally the
implementations will be compared when possible and some conclusion will be drawn.
9
11. SEMANTIC WEB
1. Semantic Web
“The Semantic Web is not a separate Web
but an extension of the current one,
in which information is given well-defined meaning,
better enabling computers and people to work in cooperation.”
(Berners-Lee, Hendler & Lassila 2001)
1.1. Origins of the Semantic Web
The above quotation comes from one of the best known articles about the Semantic Web1 – “The
Semantic Web” published in the year 2001 in Scientific American. It is considered as the initia-
tor of the “semantic revolution” in IT. In fact, due to its popularity, a worldwide discussion has
emerged and some implementation efforts have commenced, but the first ideas were presented by
Tim Berners-Lee earlier in his book, “Weaving the Web: Origins and Future of the World Wide
Web” (Berners-Lee & Fischetti 1999).
Figure 1.1: W3C’s Semantic Web Logo
From the very beginning he was thinking about
the Web as the universal network, where docu-
ments will be connected to each other by their
meaning in a way that enables automatic process-
ing of information. In “Weaving the Web” he
summarised not only his work on developing the Web in the current form, but he was also try-
ing to answer the questions about the future of the Web.
1
Google Scholar finds it cited in 5304 articles what gives it a first place for searching phrase “semantic web”.
Source: http://scholar.google.co.uk/scholar?hl=en&lr=&q=semantic++web&btnG=Search. Retrieved on 2008.01.29.
10
12. SEMANTIC WEB
Even before his article in Scientific American, Tim Berners-Lee and scientists gathered around
the World Wide Web Consortium (W3C) started to work on technologies that will form the basis
for the Semantic Web in the future2. They were presenting the vision in numerous lectures around
the world and supporting initiatives for deploying these technologies in some specific knowledge
areas. The first document, “Semantic Web Roadmap” (Berners-Lee 1998), where ideas about the
architecture were described, was published in September 1998.
1.2. From the Web of documents to the Web of data
The word “semantics”, according to Encyclopedia Britannica Online3, means “the philosophical
and scientific study of meaning”. The keyword is the word “meaning”.
The current version of the Web, that was implemented in 1990’s, is based on the mechanism of
linking between documents published on web servers. However despite its universality, the mech-
anism of hyperlinks does not allow a transfer of the meaning of the content between applications.
That inability prevents computers from using the Web content to automate everyday activities.
Computers just do not understand the information they are processing and displaying so human
involvement is needed to put the information into context and thus exchange semantics between the
systems. That problem also occurs while exchanging data between the computer systems used in
business. Different standards of storing data in applications require the use of custom-built parsers
– this increases costs and complexity or may lead to many extraction errors and data inconsistency.
The Semantic Web vision envisages that computers should be able to search, understand and use
the information they process with a little help from additional data. However there are different
ideas what that vision involves. Passin (2004, p.3) states 8 of them. The most important from the
perspective of that thesis is the vision of the Semantic Web as a distributed database. According
to Berners-Lee, Karger, Stein, Swick & Weitzner (2000), cited in Passin (2004), the Semantic
Web is about to present all the databases and logic rules allowing them to interconnect and create
a large database. Information should be easily accessed, linked and understood by computers.
2
First working draft of RDF specification was published in October 1997. RDF Model and Syntax specification was
released as W3C Recommendation a year later, in February 1999.
3
Encyclopedia Brytannica Online, http://www.britannica.com/eb/article-9110293/semantics. Retrieved on
2008.01.29.
11
13. SEMANTIC WEB
Data should be connected by relations to its meaning.
That goal can be achieved by extending the existing databases by additional descriptions of data,
usually called meta data. That supplementary information enables advanced indexing and discov-
ery of decentralised information. Moreover, searching and retrieval of information will be auto-
mated by software agents. These are dedicated applications that communicate with other services
and agents on the Web, and with the help of artificial intelligence can provide improved results or
even follow certain deduction processes. The machine-readable data will be accessible as services
over the Web that will allow computers to discover and process easily all the required information.
What is more the great amount of data that is available outside databases, e.g. static webpages,
will be understandable by machines due to semantic annotations and defined vocabularies.
1.3. World Wide Web model
Today’s model of the World Wide Web is based on a few simple principles. The most basic one
assumes that when a Web document links to another, the linked document can be considered as a
resource. In the Semantic Web, resources are identified using unique Uniform Resource Identifier
(URI). In the current Web, resources such as files or web pages are identified by standardised
Uniform Resource Locators (URLs), which are a kind of URIs, but extended with the description
of its primary access method (e.g. http:// or ftp://). The concept of URI says that resources
may represent tangible things like files and non-tangible ideas or concepts, which even does not
have to exist, but can be thought about. What is more, the resources can be fixed or change
constantly and they are still represented by the same URI.
Over the Web the messages are being sent using the HTTP protocol4, which consists of a small
set of commands and makes it easy to implement in all kind of network software (web servers,
browsers). Although some extensions, like cookies or SSL/TLS encryption layer, are being used,
the original version of protocol does not support security or transaction processing.
Another principle of the WWW is its decentralisation and scalability. Every computer connected
to the Internet can host a web server, and this makes the Web easily extendible. There is no central
4
Hypertext Transfer Protocol (HTTP) – communication protocol used to transfer information between client and
server deployed in application layer (according to TCP/IP model). It was originally proposed by Tim Berners-Lee in
1989.
12
14. SEMANTIC WEB
authority that maintains the infrastructure. What is more, every request from client to server is
treated independently. The HTTP protocol is stateless, and this makes it possible to cache the
responses and decrease network traffic.
The Web is open – resources can be added freely. It is also incomplete, and this means that
there is no guarantee that every resource is always accessible. That implies the next attribute –
inconsistency. The information published on-line does not have to be always true. It is possible
that two resources can easily deny each other. Resources are also constantly changing. Due to
the features of HTTP protocol and utilisation of caching servers it may happen that there are two
different versions of the same resource. These aspect raise a very serious requirement on software
agents that attempt to draw conclusions from data found on the Web.
1.4. The Semantic Web’s Foundations – the Layer Cake
The Semantic Web, as an extension of the current Web, should follow the same rules as the current
model. According to that all resources should use URIs to represent objects. The Semantic Web
refers also to non-addressable resources that cannot be transferred via the network. Currently that
feature was not used as the most popular URIs – URLs, were referring to tangible documents. The
basic protocol should continue to have a small set of commands and retain no state information.
It should remain decentralised, global and operate with inconsistent and incomplete information
with all the advantages of caching of information.
The W3C, as the main organisation that is developing and promoting standards for the Seman-
tic Web, has created their own approach to its architecture. The first overview was presented in
Berners-Lee (1998) and it has been evolving together with the evolution and development of the
technologies involved. W3C published a diagram presenting the structure and dependencies be-
tween them. All the technologies are shown as layers where higher ones depend on underlying
technologies. Each layer is specialised and tends to be more complex than the layers below. How-
ever they can be developed and deployed relatively independently. The diagram is known as the
“Semantic Web layer cake”.
Description of the layers depicted in Figure 1.4 are as follows:
∙ URI/IRI — According to the Semantic Web vision all the resources should have their identi-
13
15. SEMANTIC WEB
Figure 1.2: Semantic Web’s “layer cake” diagram Source:
http://www.w3.org/2007/03/layerCake.png, [12.02.2008]
fiers encoded using URIs. The Internationalized Resource Identifier (IRI) is a generalisation
of URI extended by support for Universal Character Set (Unicode/ISO 10646).
∙ Extensible Markup Language (XML) — General-purpose markup language that allows to
encode user-defined structures of data. In the Semantic Web XML is used as a framework
to encode data but provides no semantic constraints on its meaning. XML Schema is used
to specify the structure and data types used in particular XML documents. XML is a stable
technology commonly used for exchanging data. It became a W3C Recommendation in
February 1998.
∙ Resource Description Framework (RDF) — a flexible language capable of describing data
and meta data. It is used to encode a data model of resources and relations between them
using XML syntax. RDF was introduced as a W3C Recommendation a year later than XML,
in February 1999. Semantic data models can be also serialized in alternative notations like
Turtle, N-Triples or TriX.
∙ RDF Schema (RDFS) — Used as a framework for specifying basic vocabularies in RDF
documents. RDFS is built on top of RDF that extends it by a few additional classes describ-
ing relations and properties between resources.
14
16. SEMANTIC WEB
∙ Rule: Rule Interchange Format (RIF) — It is a family of rule languages that are used for
exchanging rules between different rule-based systems. Each RIF language is called a “di-
alect” to facilitate the use of the same syntax for similar semantics. Rules exchanged by
using RIF may depend on or can be used together with RDF and RDF Schema or OWL data
models. RIF is a relatively new initiative: the W3C’s RIF Working Group was formed in
November 2005 and first working drafts were published on 30 November 2007.
∙ Query: SPARQL — A query language designed for RDF that also includes specification
for accessing data (SPARQL Protocol) and representing the results of SPARQL queries
(SPARQL Query Results XML Format).
∙ Ontology: Web Ontology Language (OWL) — Used to define vocabularies and to specify
the relations between words and terms in particular vocabularies. RDF Schema can be
employed to construct simple ontologies. However OWL was the language designed to
support advanced knowledge representation in the Semantic Web. OWL is a family of 3
sublanguages: OWL-DL and OWL-Lite based on Description Logics and OWL-Full, which
is a complete language. All three languages are popular and used in many implementations.
OWL became a W3C Recommendation in February 2004.
∙ Logic — Logical reasoning draws conclusions from a set of data. It is responsible for apply-
ing and evaluating rules, inferring facts that are not explicitly stated, detecting contradictory
statements and combining information from distributed sources. It plays a key role in gath-
ering information in the Semantic Web
∙ Proof — Used for explaining inference steps. It can trace the way the automated reasoner
deducts conclusions, validate it and, if needed, adjust the parameters.
∙ Trust — Responsible for authentication of services and agents together with providing ev-
idence for the reliability of data. This is a very important layer as the Semantic Web will
achieve its full potential only when there is a trust in its operations and the quality of data.
∙ Crypto — Involves the deployment of Public Key Infrastructure, which can be used to au-
thenticate documents with digital signature. It is also responsible for secure transfer of
information.
15
17. SEMANTIC WEB
∙ User Interface and Applications — This layer encompasses tools like personal software
agents that will interact with end-users and the Semantic Web together with Semantic Web
Services, which are able to communicate between each other to exchange data and provide
value for the users.
The diagram in Figure 1.4 presents the most recent version of the architecture. The original archi-
tecture was single-stacked – the layers were placed one after another (except the security layer).
However the years of research on the particular technologies has shown that it is impossible to
separate the layers. Kifer, de Bruijn, Boley & Fensel (2005) discuss the interferences between
technologies also taking into consideration the technologies that were not developed by W3C
(e.g. SWRL5, SHOE6). The conclusion is that the multi-stack architecture is a better way of show-
ing the different features of the technological basis for the rule and ontology layers.
Antoniou & van Harmelen (2004, p.17) suggest that two principles should be followed when
considering the diagram: downward compatibility and upward partial understanding. The first
one assumes that applications operating on certain layers should be aware and able to use the
information written at lower levels. Upward partial understanding says that applications should at
least partially take advantage of information available at higher layers.
1.5. The Semantic Web – Today and in the Future
Although the Semantic Web has strong foundations in research results, not all of the technologies
presented in Figure 1.4 are yet developed and implemented. Only RDF(S)/XML and OWL stan-
dards are stable and implementations are available. SPARQL and RIF have appeared quite recently
and the implementations are in development phase. The higher layers are still under research.
The existing technologies are becoming popular. There are many tutorials and books that explain
5
Semantic Web Rule Language (SWRL) – proposal for Semantic Web rules interchange language that combines
simplified OWL Web Ontology Language (OWL DL and OWL Lite) with RuleML. The specification was created by
National Research Council of Canada, Network Inference and Stanford University and submitted to W3C in May 2004.
Source: http://www.w3.org/Submission/SWRL/. Retrieved on: 16.02.2008.
6
Simple HTML Ontology Extension (SHOE) – small extension to HTML that allows to include machine-
processable meta data in static webpages. SHOE was developed around 1996 by James Handler and Jeff Heflin.
Source: http://www.cs.umd.edu/projects/plus/SHOE/. Retrieved on: 16.02.2008.
16
18. SEMANTIC WEB
how to deploy the RDF or create ontologies. Developers are working within active communities
(e.g. http://www.semanticweb.org/). There are many implementations that support the RDF model
including editors, stores for datasets and programming environments7. Some of them are commer-
cial products (e.g. Siderean’s Seamark Navigator used by Oracle Technology Network portal8),
some are being developed by Open Source communities, e.g. Sesame.
Also a number of vocabularies and ontologies have been developed. Very popular vocabularies
are Dublin Core9 and Friend of a Friend10, which were created by non-commercial initiatives11.
Health care and life sciences is a sector where the need for integrating diverse and heterogeneous
datasets evoked the creation of the first large ontologies, e.g. GeneOntology12 that describes genes
and gene product attributes or The Protein Ontology Project13 that classifies a knowledge about
proteins. Other disciplines are also developing their ontologies, like eClassOwl14 that classifies
and describes products and services for e-business or WordNet15 – a semantic lexicon for English
language. We can find ontologies that integrate data from environmental sciences (e.g. climatol-
ogy, hydrology, oceanography) or are deployed in a number of e-government initiatives16. Another
source of meta data has arisen along with Web 2.0 portals known as social software. The commu-
nities of contributors (folksonomies) interested in particular information, describe it with tags or
keywords and publish it on-line. Although tagging offers a significant amount of structured data it
is being developed to meet different goals than ontologies, which are defining data more carefully,
taking into consideration relations and interactions between datasets.
Despite its wider adoption, the OWL family needs more reliable tools that support modelling and
application of ontologies that might be used by non-technical users. On the other hand we cannot
just choose any URI and search existing data stores – the data exposure revolution has not yet
happened (Shadbolt, Berners-Lee & Hall 2006).
7
The list of all implementations is available on W3C Wiki – http://esw.w3.org/topic/SemanticWebTools.
8
Source: OTN Semantic Web (Beta), http://www.oracle.com/technology/otnsemanticweb/index.html, 2008.02.25.
9
Dublin Core Metadata Initiative, http://www.dublincore.org/
10
The Friend of a Friend (FOAF) project, http://www.foaf-project.org/
11
There are webpages where available vocabularies are listed, e.g. SchemaWeb (http://www.schemaweb.info/).
12
GeneOntology, http://www.geneontology.org/
13
The Protein Ontology Project, http://proteinontology.info/
14
eClassOwl, http://www.heppnetz.de/projects/eclassowl/
15
WordNet, http://wordnet.princeton.edu/
16
Integrated Public Sector Vocabulary was created in United Kingdom, http://www.esd.org.uk/standards/ipsv. Re-
trieved on 1.03.2008.
17
19. SEMANTIC WEB
According to Herman (2007b) the Semantic Web, once only of interest of academia, has been
already spotted by small businesses and start-ups. Now the idea is becoming attractive to large
corporations and administration. Major companies offer tools or systems based on the Semantic
Web concept. Adobe has created a labelling technology that allows meta data to be added to most
of their file formats17. Oracle Corporation is not only supporting RDF in their products but is also
using RDF as a base for their Press Room18. The number of companies that are participating in
W3C Semantic Web Working Groups is increasing. Corporate Semantic Web was chosen by Gart-
ner in 2006 as the top emerging technology that will improve the quality of content management,
system interoperability and information access. They predict that it will take 5 to 10 years for
Semantic Web technology to become reliable (Espiner 2006).
Although RDF and OWL are gaining popularity there is some criticism around these technologies.
It is unclear how to extract RDF data from relational databases. It is possible to do it semi-
automatically, but current mechanisms still require a huge amount of data to be manually corrected.
Also there will be an increase in costs of preparing data if it has to be published in format accessible
for machines (RDF) and adjusted for humans to read. The XML syntax of RDF itself is not human-
friendly. To overcome that problem the GRDDL19 mechanism was created. It potentially allows
binding between XHTML and RDF with the use of XSLT.
Another concern is about censorship, as semantic data will be easily accessible, it will be also easy
to filter data or block it thoroughly. Authorities may control the creation and viewing of controver-
sial information as its meaning will be more accessible for automated content-blocking systems.
Also the popularity of FOAF profiles with geo-localisation will decrease users’ anonymity.
There is still a need to develop and standardize functionalities like simpler ontologies, the support
for fuzzy logic and rule based reasoning. There are some initiatives like RIF to regulate auto-
mated reasoning, but there is a lack of standards in that field. Different knowledge domains are
implementing different approaches to inference – the most suitable in particular cases. Also the
shape of the layers responsible for trust, proof and cryptography still remains a puzzle. Developing
17
Extensible Metadata Platform (XMP) is supported by major Adobe’s products like Adobe Acrobat, Adobe Photo-
shop or Adobe Illustrator. Adobe has also published a toolkit that allows integrating XMP into other applications. XMP
Toolkit is available under the BSD licence. Source: http://www.adobe.com/products/xmp/index.html
18
Oracle Press Releases, http://pressroom.oracle.com/
19
Gleaning Resource Descriptions from Dialects of Language (GRDDL), became a W3C Recommendation on
11.09.2007, http://www.w3.org/TR/grddl/. Retrieved on 1.03.2007.
18
20. SEMANTIC WEB
ontologies is an additional challenge as interoperability, merging and versioning remains unclear.
Antoniou & van Harmelen (2004, p.225) finds the problem with ontology mapping as probably
the most complicated, as there is no central control over application of standards and technologies
during modelling ontologies in open Semantic Web environment.
The Semantic Web vision itself was also criticised. Even Tim Berners-Lee recently said that even
though the idea is simple, it still remains unrealized (Shadbolt et al. 2006). Walton (2006, p.109)
raises the layered model for discussion as the present shape imply certain difficulties for the design
of software agents – providing a unified view of independent layers might be a challenge.
The Semantic Web, like the current Web, relies on the principle that people provide reliable con-
tent. Other important aspects are the fundamental design decisions and their consequences in
creating and deploying standards. Both are being fulfilled – particular communities are working
on RDF datasets and there is a broad discussion about each of the layers of the Semantic Web
focused around W3C Working Groups. As Shadbolt et al. (2006) says, the Semantic Web con-
tributes to Web Science, a science that is concerned with distributed information systems operating
on global scale. It is being encouraged by the achievements of Artificial Intelligence, data mining
and knowledge management.
19
21. SPARQL
2. SPARQL
2.1. RDF – data model for Semantic Web
The vision of the Semantic Web required new approach to handling data and metadata while it
came to applications. To meet the expectations, W3C in October 1997 published a working draft
for a new universal language to form a basis for the Semantic Web. The Resource Description
Framework (RDF) is providing a standard way to describe, model and exchange information about
resources. It was created as a high-level language and thanks to its low expressiveness, the data is
more reusable. RDF Model and Syntax Specification became W3C recommendation in February
1999. The current version of the specification was published in February 2004. The RDF is in
fact a data model encoded with XML-based syntax. It provides a simple mechanism to make
statements about resources. RDF has a formal semantics that is the basis for reasoning about the
meaning of an RDF dataset.
The RDF statements are usually called triples as they consist of three elements: subject (re-
source), predicate (property) and object (value). The triples are similar to simple sentences with
subject-verb-object structure. The structure of an RDF triple can be represented as a logical for-
mula 𝑃(𝑥, 𝑦) where binary predicate 𝑃 relates object 𝑥 to object 𝑦. Figure 2.1 depicts its struc-
ture (Passin 2004).
(
𝑠𝑢𝑏𝑗𝑒𝑐𝑡
⏞ ⏟
town1 ,
𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒
⏞ ⏟
name ,
𝑜𝑏𝑗𝑒𝑐𝑡
⏞ ⏟
”Paisley” )
⏟ ⏞
𝑡𝑟𝑖𝑝𝑙𝑒
Figure 2.1: Structure of RDF triple, after Passin (2004).
The subject of a triple is a resource identified by an URI. An URI reference is usually presented
20
22. SPARQL
in URL style extended by fragment identifier – the part of the URI that follows “#”1. A fragment
identifier relates to some portion of the resource. Also different URI schemes and its variations are
allowed, however the generic syntax has to remain as defined. The whole URI should be unique but
not necessarily should enable access to resource. The problem with URI arises with names of the
objects that are not unique – the mechanism allows anyone to make statements about any resource.
Another technique to identify a resource is to refer to its relationships with other resources. The
RDF accepts resources that are not identified by any URI. These resources are known as blank
nodes or b-nodes and are given internal identifiers, which are unique and not visible outside the
application. Blank nodes can only stand as subjects or objects in particular triple.
Predicates are special kind of resources, also identified by URIs, that describe relations between
subjects and objects. Objects can be named by URIs or by constant values (literals) represented by
character strings. These are the only elements that can be represented by plain string. Plain literals
are strings extended by optional language tag. Literals extended by datatype URI are called typed
literals. Objects are the only elements that can be represented by plain strings. Literals can be
extended by the definition of the datatype, then the whole structure is called typed literal. RDF,
unlike database systems or programming languages, does not have built-in datatypes – it bases on
ones inherited from XML Schema2, e.g. integer, boolean or date. The use of externally defined
datatypes is allowed, but in practice not popular (Manola & Miller 2004).
The full triples notation requires that URIs are written as the complete name in angle brack-
ets. However many RDF applications uses the abbreviated forms for convenience. The full URI
reference is usually very long (e.g. <http://dbpedia.org/resource/Paisley>). It
is shortened to prefix and resource name (e.g. dbpedia:Paisley). Prefix is assigned to the
namespace URI. That mechanism is derived from XML syntax and is known as XML QNames3.
1
The Uniform Resource Indetifier (URI) is defined by RFC 3986. The generic syntax is URI = scheme ":"
hier-part [ "?" query ] [ "#" fragment ]. Source: http://tools.ietf.org/html/rfc3986, [05.05.2008].
2
The XML Schema datatypes are defined in W3C Recommendation “XML Schema Part 2: Datatypes” (Avail-
able at: http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/), which is a part of specification of XML Schema
language.
3
The QNames mechanism is described in “Using Qualified Names (QNames) as Identifiers in XML Content” avail-
able at: http://www.w3.org/2001/tag/doc/qnameids.html.
21
24. SPARQL
Figure 2.2 presents a number of triples serialized in RDF/XML syntax using the most basic struc-
tures. The preamble of the listing contains the XML Declaration that declares the namespaces
(QNames) that are used in the document. Every subject is placed in <rdf:Description> tag.
It is extended by URI placed in rdf:about attribute. Predicates are called property elements
and they are placed within subject tag. Subject can contain one or multiple outgoing predicates.
In Figure 2.2 every subject has a number of properties. Each property has the type of relation
stated and the name of the object as attribute. Properties can also be extended by the datatype or
language attributes.
There are many methods of representing RDF statements. They can be encoded in XML syntax,
but a graph-based view is a very popular representation. The RDF graph model is a collection
of triples represented as a graph, where subjects and objects are depicted as graph nodes and
predicates are represented by arc directed from the subject node to object node. The example
of RDF graph is presented in Figure 2.3. In that case triples from Figure 2.2 were transformed
into graph. The nodes referenced by URIs are shown in oval-shaped figures. Literals are written
within rectangles. Every arc has the URI of the relationship stated. Graph-based view, due to its
simplicity, is used for explaining the concept of triple.
The other popular serialization formats for RDF are Notation3 (N3), JSON or Turtle. The RDF
triples from Figure 2.2 encoded in Turtle syntax are presented in Figure 2.4. In that case, the triples
are shown in actual subject-verb-object format. Turtle syntax is very straightforward. Every triple
is written in one line ended by dot sign. Long URIs can be replaced by short prefix names declared
using @prefix directive. Literals are simply extended by language suffix or by datatype URI.
Turtle allows some abbreviations – when more than one triple involves the same subject it can be
stated only once followed by the group of predicate-object pairs. A similar operation can be done
when subject and predicate are constant.
23
25. SPARQL
Figure 2.3: RDF graph. Based on: DBpedia (http://www.dbpedia.org), [12.03.2008]
24
26. SPARQL
@PREFIX dbpedia: <http://dbpedia.org/resource/> .
@PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@PREFIX foaf: <http://xmlns.com/foaf/0.1/> .
@PREFIX dbpedia_prop: <http://dbpedia.org/property/> .
@PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> .
dbpedia:Paisley rdfs:label "Paisley"@en .
dbpedia:Paisley foaf:img
<http://upload.wikimedia.org/wikipedia/en/0/0d/RenfrewshirePaisley.png> .
dbpedia:Paisley foaf:page <http://en.wikipedia.org/wiki/Paisley> .
dbpedia:Paisley rdfs:label "Paisley (Szkocja)"@pl .
dbpedia:Paisley dbpedia_prop:reference <http://www.paisleygazette.co.uk> .
dbpedia:Paisley dbpedia_prop:latitude "55.833333"ˆˆxsd:double .
dbpedia:Paisley dbpedia_prop:longitude "-4.433333"ˆˆxsd:double .
dbpedia:University_of_the_West_of_Scotland [
dbpedia_prop:city dbpedia:Paisley;
dbpedia_prop:name "University of the West of Scotland"@en;
dbpedia_prop:established "1897"ˆˆxsd:integer;
dbpedia_prop:country dbpedia:Scotland ] .
dbpedia:William_Wallace dbpedia_prop:birthPlace dbpedia:Paisley .
dbpedia:William_Wallace dbpedia_prop:death "1305-08-23"ˆˆxsd:date .
dbpedia:William_Wallace foaf:name "William Wallace" .
Figure 2.4: RDF statements in Turtle syntax. Source: DBpedia (http://www.dbpedia.org),
[12.03.2008]
The RDF has a few more interesting features. One of them is reification that provides the possi-
bility to make statements about other statements. Reification of the statements can provide infor-
mation about its creator or usage. It might be also used in process of authenticating the source
of information. Another feature is the possibility to create containers and collections of resources
that can be used for describing groups of things. Containers, according to the requirements, can be
represented by a group of resources or literals with defined order as an option or by a group where
members are alternatives to each other. A collection is also a group of elements but it is closed –
once created it cannot be extended by any new members.
The RDF provides a simple syntax for making statements about resources. However to define
the vocabulary that will be used in a particular dataset there is a need to use RDF Vocabulary
Description Language better known as RDF Schema (RDFS). The RDFS provides a means for
describing classes of resources and defining their properties. In addition, a hierarchy of classes
can be built. Similar to object-oriented programming every resource is an instance of one or more
classes described with particular properties.
The RDFS does not have its own syntax – it is expressed by the predefined set of RDF resources.
25
27. SPARQL
The resources are identified with the prefix http://www.w3.org/2000/01/rdf-schema#
usually abbreviated to rdfs: QName prefix. To understand the special meaning of the RDFS
graph the application has to provide such features, otherwise it is processed as a regular RDF
graph.
Although the RDF is supported by W3C it is not the only solution for the Semantic Web.
Passin (2004, p.60) gives an example of Topic Maps as an ISO standard4 for handling semi-
structured data. Topic Maps were originally designed for creating indexes, glossaries, thesauri
and similar. However, their features made them applicable in more demanding domains. Topic
Maps are based on a concept of topics and associations between topics and their occurrences. All
structures have to be defined in ontologies of Topic Maps. The topics are represented with empha-
sis on the collocation and the navigation – it is easier to find the particular information and browse
closely related topics. Topic Maps can be applied as a pattern for organizing information. They
can be implemented using many technologies using native XML syntax for Topic Maps (XTM)
or even RDF. Their features make them well suited to be a part of the Semantic Web even though
they are not supported by W3C.
The RDF is a language that refers directly and unambiguously to a decentralized data model and
unlike XML it is straightforward to differentiate information from the syntax. However, that
technology has some limitations. According to Jorge Cardoso (2006) RDF with RDFS is not able
to express the equivalence between terms defined in independent vocabularies. The cardinality and
uniqueness of terms cannot be preserved. What is more the disjointness of terms and unions of
classes are impossible to express with the limited functionality of RDF. There is also no possibility
to negate statements. Antoniou & van Harmelen (2004, p.68) points out another limitation –
RDF is using only binary predicates but in certain cases, it would be more natural to model a
relation with more than two arguments. In addition, the concept of properties and reification
can be misleading for the modellers. Finally, the XML syntax of RDF, being very flexible and
accessible for machine processing, is hardly comprehensible for humans.
Despite of all the disadvantages the RDF retains a good balance between complexity and expres-
siveness. What is more it has become a de facto world standard for the Semantic Web, and is
heavily supported by W3C and developers around the world.
4
Topic Maps were developed as ISO standards which is formally known as ISO/IEC 13250:2003.
26
28. SPARQL
2.2. Querying the Semantic Web
2.2.1. Semantic Web as a distributed database
One of the visions of the Semantic Web says that it is able to provide a common way to access,
link and understand data from different sources available on-line. The Web will become a large
interlinked database. This revolutionary approach challenges the current state of knowledge in
managing data. Currently Relational Database Management Systems (RDBMS) are some of the
most advanced software ever written. They are the largest data resources in the world. Over
30 years of experience in research and implementations has resulted in use of sophisticated mech-
anisms like query optimization, clustering or retaining ACID properties5. Now the principles of
the Semantic Web imply the need of implementing new technologies for managing semantic data.
The Semantic Web has its basic data model – RDF. Passin (2004, p.25) says that RDF data model
can be compared to Relational Data Model. In relational databases, data is organized in tables,
where every row is identified by primary key and has a defined structure. A collection of attributes
that forms a row is called a tuple. Every tuple can be divided into a number of RDF triples
where the primary key becomes the subject. Tuples can be transformed into triples, but the reverse
operation might not be possible. In general, RDF data model is less structured than database.
Every table in the relational model has its defined structure which cannot be extended6 – data
is structured and the number of attributes (properties) is known. RDF allows adding new triples
extending the information about the resource. The triples can be partitioned between different
nodes, even the ones that are not accessible. An RDBMS maintains consistency across all the data
that it manages. Walton (2006) calls this the closed-world assumption, where everything that is
not defined is false. On the contrary, in the Semantic Web, false information has to be specified
or they are just unknown – this is an open-world model. Thanks to that, RDF is more flexible.
However, such an assumption implies the possibility of inconsistency and missing information.
The results of the query vary with the availability of datasets. The returned information can be
only partial, and its size and computing time is unpredictable.
5
Atomicity, Consistency, Isolation, Durability (ACID) are the basic properties that should be fulfilled by Database
Management System (DBMS) to ensure that transactions are processed reliably.
6
In fact every RDBMS permits the modifications of the table structure (ALTER TABLE command), but altering
data model in such a way is not a regular operation so in that case can be omitted.
27
29. SPARQL
Walton (2006) claims that the Semantic Web data is more network structured than relational. In
RDBMS, data is defined in the relation between static tables. Queries are performed on a known
number of tables using set-based operations. In RDF, the data model before querying dataset,
has to be separated from the whole Web of constantly changing stores. The constant change of
asserted data implies that the results of the queries might be incomplete or even unavailable. What
is more, Semantic Web knowledge can be represented in different syntactic forms (RDF with
RDFS, OWL), which results in extended requirements for query languages as they have to be
aware of the underlying representation. In addition, the structure of the datasets will be unknown
to the querying engines, so they will have to rely on specified web services that will perform the
required selection on their behalf.
The Semantic Web principles put very strict constraints on the services that will manage and query
semantic data. The RDF data model ensures simplicity and flexibility so the responsibility for the
results of the queries will be borne by the query languages and automated reasoners.
2.2.2. Semantic Web queries
The new data model that was designed for the Semantic Web required new technologies that
would allow queries on semantic datasets. New query languages were needed to enable higher-
level application development. The inspiration came from well established RDB Management
Systems and Structured Query Language (SQL) that is used there for extracting relational data.
However, the relational approach could not be directly translated into the semantic data model.
The RDF data model with its graph-like model, blank nodes and semantics made the problem
more complex. The query language has to understand the semantics of RDF vocabulary to be able
to return correct information. That is why XML query languages, like XQuery or XPath, turned
out to be insufficient as they operate on lower level of abstraction than RDF (Figure 1.4).
To effectively support the Semantic Web, a query the language should have the following proper-
ties (Haase, Broekstra, Eberhart & Volz 2004):
∙ Expressiveness — specifies how complicated queries can be defined in the language. Usu-
ally the minimal requirement is to provide the means proposed by relational algebra.
∙ Closure — assumes that the result of the operation become a part of the data model, in the
28
30. SPARQL
case of RDF model, the result of the query should be in a form of graph.
∙ Adequacy — requires that query language working on particular data model use all its con-
cepts.
∙ Orthogonality — requires that all operations can be performed as independent from the
usage context.
∙ Safety — assumes that every syntactically correct query returns definite set of results.
Query languages for RDF were developed in parallel with RDF itself. Some of them were closer to
the spirit of relational database query languages, some were more inspired by rule languages. One
of the first ones was rdfDB, a simple graph-matching query language that became an inspiration
for several other languages. RdfDB was designed as a part of an open-source RDF database with
the same name. One of the followers is Squish that was designed to test some RDF query language
functionalities. Squish was announced by Libby Miller in 20017. It has several implementations,
like RDQL and Inkling8. RQL bases on functional approach, that supports generalized path ex-
pressions9. It has a syntax derived from OQL. RQL evolved into SeRQL. RDQL is a SQL-like
language derived from Squish. It is a quite safe language that offers limited support for datatypes.
RDQL had submission status in W3C but never became a recommendation10. A different approach
was used in the XPath-like query language called Versa11, where the main building block is the
list of RDF resources. RDF triples are used in traversal operations, which return the result of the
query. Another language is Triple12, a query and transformation language, QEL, a query-exchange
language developed as a part of Edutella project13 that is able to work across heterogeneous repos-
itories, and DQL14, which is used for querying DAML+OIL knowledge bases. Triple and DQL
represents rule-based approach.
7
RDF Squish query language and Java implementation available at: http://ilrt.org/discovery/2001/02/squish/,
[02.05.2008]
8
Inkling Architectural Overview available at: http://ilrt.org/discovery/2001/07/inkling/index.html, [02.05.2008]
9
RQL: A Declarative Query Language for RDF available at:
http://139.91.183.30:9090/RDF/publications/www2002/www2002.html, [02.05.2008]
10
http://www.w3.org/Submission/2004/SUBM-RDQL-20040109/
11
Specification of Versa is available at: http://copia.ogbuji.net/files/Versa.html, [02.05.2008].
12
Triple’s homepage is available at: http://triple.semanticweb.org/ , [02.05.2008]
13
Edutella is a p2p network that enables other systems to search and share semantic metadata. Homepage is available
at: http://www.edutella.org/edutella.shtml, [02.05.2008].
14
Specification of DQL is available at: http://www.daml.org/2003/04/dql/dql, [02.05.2008].
29
31. SPARQL
The variety of RDF query languages developed by different communities resulted in compatibility
problems. What is more, according to Guti´errez, Hurtado & Mendelzon (2004), different imple-
mentations were using different query mechanisms that have not been a subject of formal studies,
so there were doubts that some of them might behave unpredictably. W3C was aware of all that
weaknesses. To decrease redundancy and increase interoperability between technologies W3C had
formed in February 2004 an RDF Data Access Working Group (DAWG) that aimed to recommend
a query language, which would become a worldwide standard. DAWG divided the task into two
phases. At the beginning, they wanted to define the requirements for the RDF query language.
They reviewed the existing implementations and wanted to choose a query language that would
be a starting point for the further work in the next phase. In the second phase they prepared a
formal specification together with test cases for the RDF query language (Prud’hommeaux 2004).
In October 2004, the First Working Draft of SPARQL Query Language was published.
2.3. The SPARQL query language for RDF
DAWG worked on SPARQL specification for more than a year. After six official Working Drafts15,
in April 2006, DAWG published a W3C Candidate Recommendation for SPARQL Query Lan-
guage for RDF. However, the community involved in developing a new standard pointed out a
several weaknesses of that version of SPARQL specification and it was returned to Working Draft
status in October 2006. After a few months and one more working Draft the specification reached
a status of Candidate Recommendation in June 2007. When the exit criteria stated in the document
were met (e.g. each SPARQL feature needed to have at least two implementations and the results
of the test was satisfying), the specification went smoothly to Proposed Recommendation stage in
November 2007. Finally, the SPARQL Query Language for RDF became a W3C recommendation
on 15th of January 2008.
The word SPARQL is an acronym of SPARQL Protocol and RDF Query Language (SPARQL
15
The official W3C Technical Report Development Process assumes that work on every document starts from the
Working Draft. After positive feedback from the community there is a Candidate Recommendation being published.
When the document gathers satisfying implementation experience it moves to Proposed Recommendation status. This
mature document is waiting for the approval from W3C Advisory Committee. The last stage is the W3C Recommen-
dation, which ensures that the document is a W3C standard. Source: World Wide Web Consortium Process Document
(2005)
30
32. SPARQL
Figure 2.5: The history of SPARQL’s specification. Based on SPARQL Query Language for RDF
(2008)
Frequently Asked Questions 2008). In fact the SPARQL query language is closely related to two
other W3C standards: SPARQL Protocol for RDF16 and SPARQL Query Results XML Format17.
Although SPARQL is a W3C standard there are twelve open issues waiting to be resolved by
DAWG.
The SPARQL query language has an SQL-like syntax. Its queries use required or optional graph
patterns and return a full subgraph that can be a basis for the further processing. SPARQL uses
datatypes and language tags. Patterns can be also matched with the required functional constraints.
Additional features include sorting the results and limiting their number or removing duplicates.
SPARQL does not have the complete functionality that was requested by its users. Some of the
features are being implemented as SPARQL extensions. To avoid inconsistency between imple-
mentations W3C keeps a list of official SPARQL Extensions on their Wiki18. The list contains
a number of missing features including the proposal for insert, update and delete features for
SPARQL, creating subqueries or using aggregation functions.
16
SPARQL Protocol for RDF defines a remote protocol for transmitting SPARQL queries and receiving their results.
It became a W3C Recommendation in January 2008. The specification is available at: http://www.w3.org/TR/rdf-
sparql-protocol/.
17
SPARQL Query Results XML Format specify the format of XML document representing the results of SELECT
and ASK queries. It was recognized as W3C recommendation in January 2008. The specification is available at:
http://www.w3.org/TR/rdf-sparql-XMLres/.
18
The list is available at: http://esw.w3.org/topic/SPARQL/Extensions, [06.04.2008].
31
33. SPARQL
2.4. Implementation model
SPARQL can be used for querying heterogeneous data sources that operates on native RDF or has
an access to RDF dataset via middleware. The model of possible implementations is presented in
Figure 2.4. Middleware in that case is mapping the SPARQL query into SQL, which operates on
RDF data fitted into relational model. The main advantage of that approach is the possibility of
using the advanced features of RDBMS and benefitting from the years of experience in managing
huge amounts of data. However, the approach still requires the semantic data to be accessible
as an RDF model. Nowadays a great amount of data is still being stored in relation model. To
make it accessible it would have to be transformed into RDF data model, which would be time
consuming and may not be always possible. Most of the current computer systems operate on the
data encapsulated in relational model and revolution in such approach is very unlikely. One of the
suggested solutions is the automatic transformation of relational data into the Semantic Web with
the help of Relational.OWL (de Laborda & Conrad 2005).
Figure 2.6: SPARQL implementation model. Source: Herman (2007a)
Relational.OWL is an application independent representation format based on OWL language that
describes the data stored in relational model together with the relational schema and its semantic
32
34. SPARQL
interpretation. The solution consists of three layers: Relational.OWL on the top, ontology created
with Relational.OWL to represent database schema and data representation on the bottom, which
is based on another ontology. It can be applied to any RDBMS. Relational data represented by
Relational.OWL is accessible like normal semantic data, so can be queried by SPARQL. The
main advantage of such approach is the possibility of publishing relational data in the Semantic
Web with almost no cost of transforming them to RDF. What is more the changes of data stored
relationally together with its schema are automatically transferred to its semantic representation.
However all the imperfections of database schema affect the quality of the generated ontology.
To avoid that, Relational.OWL can be extended with additional manual mapping as described in
P´erez de Laborda & Conrad (2006). In that case, the possibility to generate a graph from the query
results is being used. The subgraph involves the manual adjustments of the original ontology.
Such a dataset is mapped to the target ontology and is free from the drawbacks of Relational.OWL
automatic mapping.
The technology is still under development. de Laborda & Conrad (2005) indicates only the pos-
sibility of representing relational data as a mature feature. Further studies will be directed into
supporting data exchanges and replication.
A similar approach is found in the D2RQ language (Bizer, Cyganiak, Garbers & Maresch 2007).
This is a declarative language that describes mappings between relational data and ontologies. It
is based on RDF and formally defined by D2RQ RDFS Schema
(http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1). The language does
not support Data Modification Language, the mappings are available in read-only mode. D2RQ is
a part of the wider solution called D2RQ Platform. Apart from the implementation of the language,
the Platform includes the D2RQ Engine, which translates queries into SQL, and the D2R Server,
which is an HTTP server with extended functionality including support for SPARQL.
Another interesting implementation of such approach is Automapper (Matt Fisher & Joiner 2008).
The tool is a part of a wider architecture that processes SPARQL query over multiple data sources
and returns combined query result. Automapper uses D2RQ language to create data source ontol-
ogy and mapping instance schema, both based on a relational schema. These ontologies are used
for decomposing a semantic query at the beginning of processing and translating SPARQL into
SQL just before executing it against RDBMS. To decrease the number of variables and statements
33
35. SPARQL
used in processing a query and to improve performance, Automapper uses SWRL rules that are
based on database constraints. The solution is available in Asio Tool Suite, a software package for
managing data created by BBN Technologies19.
The implementations mentioned above are not the only ones available. The community gath-
ered around MySQL is working on SPASQL20, a SPARQL support built into the database. Data
integration solutions, like DartGrid or SquirellRDF21, are also available. Finally the all-in-one
suits, like OpenLink Virtuoso Universal Server22, can be used for query non-RDF data stores with
SPARQL or other Semantic Web query languages.
Mapping relational databases, while having indisputable advantages, has also some limitations.
Data in RDBMS very often are messy and they do not conform to widely accepted database design
principles. To meet the expectations and provide high quality RDF data the mapping language has
to be very expressive. It should have a number of features, like sophisticated transformations,
conditional mappings, custom extensions and ability to cope with data organized at different level
of normalization.
Future users expect the data to be highly integrated and highly accessible. RDF datasets that has
relational background are still not reliable. There is a need of some studies over mechanisms of
querying multiple data sources, data sources discovery or schema mapping as the current solutions
based on RDF and OWL are insufficient.
Using a bridge between SPARQL and RDBMS is the most demanding problem, but the applica-
tions will seriously increase the availability of semantic data. However, as depicted in Figure 2.4,
it is not the only medium that SPARQL can query. Being very powerful RDF is a bit messy tech-
nology. What is more embedding it into XHTML is rather useless as applications built around
HTML do not recognise it. In addition, transforming data already available in XHTML would
need significant amount of work. To simplify the process of embedding semantic data into web
pages W3C started to work on set of extensions to XHTML called RDFa23. RDFa is a set of at-
tributes that can be used within HTML or XHTML to express semantic data (RDFa Primer 2008).
19
BBN Technologies, http://www.bbn.com/.
20
SPASQL: SPARQL Support In MySQL, http://www.w3.org/2005/05/22-SPARQL-MySQL/XTech.
21
SquirellRDF, http://jena.sourceforge.net/SquirrelRDF/.
22
Openlink Virtuoso Universal Server Platform, http://www.openlinksw.com/virtuoso/.
23
The first W3C Working Draft was published in March 2006. For the time of writing RDF has still the same status
– the latest Working Draft was published in March 2008.
34
36. SPARQL
It consists of meta and link attributes that are already existing in XHTML version 1 and a
number of new ones that are being introduced by XHTML version 2. RDFa attributes can extend
any HTML element, placed on document header or body, creating a mapping between the element
and desired ontology and make it accessible as an RDF triple. The attributes does not affect the
browser’s display of the page as HTML and RDF are separated. The most important advantage
of RDFa is that there is no need to duplicate data publishing it in human-readable format and in
machine-readable metadata. There are no standards of publishing RDFa attributes, so every pub-
lisher can create their own ones. Another benefit is the simplicity of reusing the attributes and
extending the already existing ones with new semantics.
RDFa in some cases is very similar to microformats. However when each microformat has defined
syntax and vocabulary, RDFa is only specifying the syntax and rely on vocabularies created by
publishers or independent ones like FOAF or Dublin Core.
Microformat is the approach to publishing metadata about the content using HTML or XHTML
with some additional attributes specific for each format. Every application that is aware of these
attributes can extract semantics from the document they were embedded in. They do not affect
other software, e.g. web browsers. There are a number of different microformats, most of them
developed by community gathered around Microformats.org. A very popular one is XFN, which
is a way to express social relationships with the usage of hyperlinks. Other common microfor-
mats are hCard and hCalendar, which are the way to embed information based on vCard24 and
iCalendar25 standards in documents.
Figure 2.7: The process of transforming calendar data from XHTML extended by hCalendar mi-
croformat into RDF triples. Source: GRDDL Primer (2007).
SPARQL is also able to query documents, which has some semantic information embedded in the
content using e.g. microformats. To process a query over such document SPARQL engine need to
24
vCard electronic business card is common standard, defined by RFC 2426 (http://www.ietf.org/rfc/rfc2426.txt), for
representing people, organizations and places.
25
iCalendar is a common format for exchanging information about events, tasks, etc. defined by RFC 2445
(http://tools.ietf.org/html/rfc2445).
35
37. SPARQL
know the “dialect” that was used for encoding metadata. Being aware of the barrier, W3C started
to work on universal mechanism of accessing semantics written in non-standard formats. At the
end of 2006, they introduced mechanism for Gleaning Resource Descriptions from Dialects of
Languages (GRDDL). GRDDL introduced a markup that indicates if the document includes data
that complies with the RDF data model, in particular documents written in XHTML and generally
speaking in XML. The appropriate information is written in the header of the document. Another
markup links to the transformation algorithm for extracting semantics from the document. The
algorithm is usually available as XSLT stylesheet. The SPARQL engine extracts the metadata
from the document, applying transformations fetched from the relevant file, and presents data as
in the RDF data model. The process of transforming metadata encoded in a specific “dialect” into
RDF is depicted in Figure 2.7.
SPARQL together with some related technologies was designed to be a unifying point for all
the semantic queries. SPARQL engines will be able to serve dedicated applications and other
SPARQL endpoints providing information that they can extract from the documents that are di-
rectly accessible for it. Some implementations of this mechanism already exist. One of them is the
public SPARQL endpoint to DBpedia26 that is able to return data from other semantic datastores
that are linked to its dataset.
2.5. SPARQL’s syntax
SPARQL is a pattern-matching RDF query language. In most cases, the query consists of set of
triple patterns called basic graph pattern. The patterns are similar to RDF triples. The difference
is that each of the elements can be set as a variable. That pattern is matched against RDF dataset.
The result is a subgraph of original dataset where all the constant elements of patterns are matched
and the variables are substituted by data from matched triples. The pair of variable and RDF data
matched to the variable is called a “binding”. The number of related bindings that form a row in
the result set is known as the “solution”.
The SPARQL basic syntax is very similar to SQL – it starts with SELECT clause called projection,
which identifies the set of returned variables, and ends with WHERE clause providing a basic graph
pattern. Variables in SPARQL are indicated by $ or ? prefixes. Similarly to Turtle syntax URIs
26
DBpedia public SPARQL endpoint is available at: http://dbpedia.org/sparql, [02.05.2008].
36
38. SPARQL
can be abbreviated using PREFIX keyword and prefix label with a definition of the namespace.
If the namespace occurs in multiple places, it can be set as a base URI. Then relative URIs, like
<property/>, are resolved using base URI. Triple patterns can be abbreviated in the same way
as in Turtle syntax – a common subject can be omitted using “;” notation and a list of objects
sharing the same subject and predicate can be written in the same line separated by “,”. The
query results can contain blank nodes, which are unique in the subgraph and indicated by “ :”
prefix.
The simple query to find a name of the university in Paisley from the dataset presented in Figure 2.4
is shown in Figure 2.8
BASE <http://dbpedia.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpedia: <property/>
SELECT DISTINCT ?city ?uniname
WHERE {
?city rdfs:label "Paisley (Szkocja)"@pl .
?uni dbpedia:city ?city .
?uni dbpedia:established "1897"ˆˆxsd:integer .
?uni dbpedia:name ?uniname .
}
city uniname
http://dbpedia.org/resource/Paisley University of the West of Scotland
Figure 2.8: Simple SPARQL query with the result. Source: DBpedia (http://www.dbpedia.org),
[12.04.2008]
SPARQL has a number of different query result forms. SELECT is used for obtaining variable
bindings. Another form is CONSTRUCT that returns an RDF dataset build on the graph pattern that
is applied to the subgraph returned by the query. This feature can be used to create RDF subgraphs
that become a base for the further processing, e.g. Relational.OWL is using it to map automatically
created ontology based on relational schema into desired ontology. Figure 2.9 presents the usage
of CONSTRUCT clause to build a subgraph according to required pattern.
Another two forms are ASK and DESCRIBE. First of them returns a boolean value that indicates
if the query pattern matches the RDF graph or not. The usage of the ASK clause is similar to
the SELECT clause, the only difference is that there is no specification of returned variables.
DESCRIBE is used to obtain all triples from RDF dataset that describe the stated URI.
37
39. SPARQL
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia: <http://dbpedia.org/property/>
CONSTRUCT {?uni <http://dbpedia.org/property/located_in> ?city.
?uni <http://dbpedia.org/property/has_name> ?uniname }
WHERE {
?city rdfs:label "Paisley (Szkocja)"@pl .
?uni dbpedia:city ?city .
?uni dbpedia:established "1897"ˆˆxsd:integer .
?uni dbpedia:name ?uniname .
}
Returned RDF subgraph serialized in Turtle:
<http://dbpedia.org/resource/University_of_the_West_of_Scotland>
<http://dbpedia.org/property/located_in> <http://dbpedia.org/resource/Paisley>;
<http://dbpedia.org/property/has_name> "University of the West of Scotland"@en.
Figure 2.9: Application of CONSTRUCT query result form with the results of the query serialized
in Turtle syntax. Source: DBpedia (http://www.dbpedia.org), [12.04.2008]
Every query language should provide possibilities to filter the results returned by the generic query.
SPARQL uses FILTER clause to restrict the result by adding filtering conditions. Using condi-
tions SPARQL can filter the values of the strings with regular expressions defined in XQuery 1.0
and XPath 2.0 Functions and Operators (2007) W3C specification. Also a subset of functions and
operators used in XPath27 is available – all the arithmetic and logical functions comes from that
language. However SPARQL introduces a number of new operators, like bound(), isIRI()
or lang(). All of them are described in detail in the SPARQL Query Language for RDF (2008).
There is also a possibility to use external functions defined by an URI. That feature may be used
to perform transformations not supported by SPARQL or for testing specific datatypes.
After applying filters, SPARQL returns the result of graph pattern matching. However, the list of
query solutions is in random order. Similarly to SQL, SPARQL provides a means to modify the set
of results. The most basic modifier is ORDER BY clause that orders the solutions according to the
chosen binding. The solutions can be ordered ascending, using ASC() modifier, or descending
indicated by DESC() modifier.
It is common that the solutions in result dataset are multiplied. The keyword DISTINCT ensures
that only unique triples are returned. The REDUCED modifier has similar functionality. However
27
XML Path Language (XPath) is a language to address parts of the XML document. It provides a possibilities to
perform operations on strings, numbers or boolean values. XPath is now available in version 2.0, which is a W3C
Recommendation since January 2007. Source: XML Path Language (XPath) 2.0 (2007)
38
40. SPARQL
when DISTINCT ensures that duplicate solutions are eliminated, REDUCED allow them to be
eliminated. In that case the solution occurs at least once, but not more than when not using the
modifier. Another two modifiers affect the number of returned solutions. The keyword LIMIT
defines how many solutions will be returned. The OFFSET clause determines the number of
solutions after which the required data will be returned. The combination of these two modifiers
returns a particular number of solutions starting at the defined point.
BASE <http://dbpedia.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpedia: <property/>
SELECT DISTINCT ?uniname ?countryname ?no_students ?no_staff ?headname
WHERE {
{
?uni dbpedia:type <http://dbpedia.org/resource/Public_university>.
?uni dbpedia:country ?country.
?country rdfs:label ?countryname.
?uni dbpedia:undergrad ?no_students.
?uni dbpedia:staff ?no_staff.
?uni rdfs:label ?uniname.
FILTER (xsd:integer(?no_staff) < 2000).
FILTER (regex(str(?country), "Scotland") || regex(str(?country),"England")).
FILTER (lang(?uniname)="en")
FILTER (lang(?countryname)="en")
}
OPTIONAL
{?uni dbpedia:head ?headname}
}
ORDER BY DESC(?no_students)
LIMIT 5
uniname countryname no students no staff headname
Napier University Scotland 11685 1648
University of the West of Scotland Scotland 11395 1300 Professor Bob Beaty
University of Stirling Scotland 6905 1872 Alan Simpson
Aston University England 6505 1,000+
Heriot-Watt University Scotland 5605 717 Gavin J Gemmell
Figure 2.10: SPARQL query presenting universities with its number of students, number of staff
and optional name of the headmaster with some filtering applied. Below are the results of the
query. Source: DBpedia (http://www.dbpedia.org), [20.04.2008]
Supporting only basic graph patterns in some cases might be a very serious limitation. SPARQL
provides mechanisms to combine a number of small patterns to obtain more complex set of triples.
The simplest one is a group graph pattern where all stated triple patterns have to match against
given RDF dataset. Group graph pattern is presented in Figure 2.8. A result of graph pattern match
can be modified using OPTIONAL clause. The RDF data model is a subject of constant change, so
39
41. SPARQL
assumption of full availability of desired information is too strict. Opposite to group graph pattern
matching OPTIONAL clause allows to extend the result set with additional information without
eliminating the whole solution if that particular information is inaccessible. When the optional
graph pattern does not match, the value is not returned and the binding remains empty. If there
is a need to present a result set that contains a set of alternative subgraphs, SPARQL provides a
way to match more than one independent graph pattern in one query. This is done by employing
UNION clause in the WHERE clause that joins alternative graph patterns. The result consists of the
sequence of solutions that match at least one graph pattern.
Finally, the SPARQL can restrict the source of the data that is being processed. RDF dataset always
consists of at least one RDF graph, which is a default graph and does not have any name. The
optional graphs are called named graphs and are identified by URI. SPARQL is usually querying
the whole RDF dataset, but scope of the dataset can be limited to a number of named graphs. The
specification of RDF dataset is set by URI using FROM clause, which indicates the active dataset.
The representation of the resource identified by URI should contain the required graph – this can
be e.g. a file with a RDF dataset or another SPARQL endpoint. If a combination of datasets is
referred to by the FROM keyword, the graphs are merged to form a default RDF graph. To query
a graph without adding it to the default dataset, the graph should be referred to by FROM NAMED
clause. In that case the relation between RDF dataset and named graph is indirect, named graph
remains independent to the default graph. To switch between the active graphs SPARQL uses the
GRAPH clause. Only triple patterns that are stated inside the clause are matched against the active
graph. Outside the clause, the triple patterns are matched against the default graph. GRAPH clause
is very powerful. It can be used not only to provide solutions from specific graphs, but is also very
useful for the right graph containing desired solution.
SPARQL is a technology that the whole community was waiting for. Its official specification
regulates the access to RDF datastores which will result in increased popularity of the whole
concept and cause SPARQL to be regarded as not just the technology for academia, but as the
stable solution that is worth implementing in common data access tools.
However the current specification of SPARQL does not fully met the requirements. The com-
munity has pointed out the lack of data modification functions as one of the most serious issues.
Another problem is an inability to use cursors caused by the stateless character of the proto-
40
42. SPARQL
col. SPARQL does not allow computing or aggregating results. This has to be done by external
modules. What is more, querying collections and containers may be complicated, which may be
especially inconvenient while processing OWL ontologies. Finally the lack of support for fulltext
searching is quite problematic.
Apart from that SPARQL is a significant step on the way to the Semantic Web, but also a starting
point for the research on the higher layers of the Semantic Web “layer cake” diagram. However
there is a place for improvement and further research. W3C should consider starting to work on
the next version of SPARQL Query Language.
2.6. Review of Literature about SPARQL
SPARQL Query Language for RDF is a relatively new technology. Indisputably it is gaining
popularity within the Semantic Web community, but there is still little research so far on the
language itself and its implementability. Google Scholar returns only 2030 search results for the
word “sparql”. This is almost nothing comparing to the number of search results when looking for
the word “rdf” – 237000, or documents related to “semantic web” – 34400028. Google Scholar is
not an objective source of knowledge – the number of results may vary depending on date and if the
local version of the search engine is used. However it shows how big is the difference in popularity
between stable RDF and brand-new SPARQL. What is more the number of publications where
SPARQL query language and the implementation issues are being under research is very small.
Usually SPARQL appears in the context of the complex architecture that is being implemented to
solve a particular problem with the means provided by the Semantic Web.
The first complete study of the requirements that semantic query language has to meet was done
in “Foundations of Semantic Web Databases” (Guti´errez et al. 2004). According to the paper,
the new features of RDF, like blank nodes, reification, redundancy and RDFS with its vocabulary,
need a new approach to queries in comparison to relational databases. The authors at the beginning
propose the notion of normal form for RDF graphs. The notion is a combination of core and closed
graphs. A core graph is one that cannot be mapped into itself. An RDFS vocabulary together with
all the triples it applies to is called a closed graph. The problem is the redundancy of triples. The
authors describe an algorithm that allows reduction of the graph. Even so computing the normal
28
The test was performed using http://scholar.google.pl on 6.05.2008.
41
43. SPARQL
and reduced forms of the graph is still very difficult. On that theoretical background a formal
definition of RDF query language is given. A query is a set of graphs considered within a set of
premises with some of the elements replaced by variables limited by a number of constraints. The
answer to a query is a separate and unique graph. A very important property that every query
language should have is the possibility to compose complex queries from the results of the simpler
ones (compositionality). A union or merge of single answers can achieve this. In the first case,
the existing blank nodes have unique names, while in merging the result sets the names of the
blank nodes have to be changed. The union operation is more straightforward and can create data
independent queries. The merge operator is more useful for querying several sources. Finally, the
authors discuss the complexity of answering queries.
Similar theoretical deliberations on semantic query language can be found in “Semantics and
Complexity of SPARQL” (Perez, Arenas & Gutierrez 2006a). However this time the authors start
from the RDF formalization done in Guti´errez et al. (2004) to examine the graph pattern facility
provided by SPARQL. Although the features of the SPARQL seem to be straightforward, in com-
bination they create increased complexity. According to the authors, SPARQL shares a number of
constructs with other semantic query languages. However, there was still a need to formalize the
semantics and syntax of SPARQL. The authors consider graph pattern matching facility limited
to one RDF data set. They start by defining the syntax of a graph pattern expression as a set of
graph patterns related to each other by 𝐴𝑁 𝐷, 𝑈 𝑁 𝐼𝑂𝑁, 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 operators and limited by
𝐹 𝐼𝐿𝑇 𝐸𝑅 expression. Then they define the semantics of the query language. It turns out that op-
erators 𝑈 𝑁 𝐼𝑂𝑁 and 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 makes the evaluation of the query more complex. There are
two approaches for computing answers to graph patterns. The first one uses operational seman-
tics what means that the graphs are matched one after another using intermediate results from the
preceding matchings to decrease the overall cost. The second approach is based on bottom up eval-
uation of the parse tree that minimizes the cost of the operation using relational algebra. Relational
algebra can be easily applied to SPARQL, however there are some discrepancies. The lack of con-
straints in SPARQL makes the 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 operator not fully equal to its relational counterpart
– left outer join. Further issues are null-rejecting relations, which are impossible in SPARQL, and
Cartesian product that is often used in SPARQL. Finally, the authors state the normal form of an
optional triple pattern that should be followed to design cost-effective queries. It assumes that
all patterns that are outside optional should be evaluated before matching the optional patterns.
42
44. SPARQL
Similar conclusion are drawn while evaluating graph patterns with relational algebra in Cyganiak
(2005b).
The authors of Perez et al. (2006a) continue their studies on semantics of SPARQL in “Semantics
of SPARQL” (Perez, Arenas & Gutierrez 2006b). The goal of this technical report was to update
the original publication with the changes introduced by W3C Working Draft published in October
2006. The authors extend the definitions of graph patterns stated in the previous paper and discuss
the support for blank nodes in graph patterns and bag/multiset semantics for solutions. At the
beginning, the authors state the basic definitions of RDF and basic graph patterns. Then they
define syntax and semantics for the general graph patterns. They also include the GRAPH operator,
which defines the default graph that is matched against the query. Another extension to Perez et al.
(2006a) is the semantics of query result forms. SELECT and CONSTRUCT clauses are also being
discussed. Finally, the definition of graph patterns is extended by the support for blank nodes and
bags. The main problem they indicate is the increased cardinality of the solutions. They finish
the report with two remarks about query entailment, which was not fully defined at the time of
writing.
The author of “A relational algebra for SPARQL” (Cyganiak 2005b) does not focus on the generic
definition of SPARQL queries. He transforms SPARQL into relational algebra, which is an in-
termediate language for the evaluation of queries that is widely used for analysing queries on the
relational model. Such an approach has significant advantages – it provides knowledge about query
optimization for SPARQL implementers, makes the SPARQL support in relational databases more
straightforward and simplifies the further analyses on the queries over distributed data sources. The
author presents only queries over basic graph. Some special cases are also considered, however
the filtering operator still has to be put under research.
At the beginning author assumes that RDF graph can be presented as a relational table with 3
columns corresponding to ?𝑠𝑢𝑏𝑗𝑒𝑐𝑡, ?𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒 and ?𝑜𝑏𝑗𝑒𝑐𝑡. Each triple is stored as a separate
record. There is also a new term introduced. An RDF tuple, which example is presented in
Figure 2.11, is a container that maps a number of variables to RDF terms and is also known as
RDF solution. Tuple is an universal term used in relational algebra. Every variable present in a
tuple is said to be bound. A set of tuples forms an RDF relation. The relations can be transformed
into triples and form a data set.
43