SlideShare une entreprise Scribd logo
1  sur  142
Télécharger pour lire hors ligne
Semantic Web:
Comparison of SPARQL
implementations
Rafał Małanij
Mat.No: B0105363
Thesis Project for the partial fulfilment of the requirements for the Master Degree
in Advanced Computer Systems Development.
University of The West of Scotland
School of Computing
29th September 2008
Abstract
The Semantic Web is the revolutionary approach to publishing data in the Internet proposed years
ago by Tim Berners-Lee. Unfortunately the deployment of the idea became more complex than
it was assumed. Although the data model for the concept is well established recently a query
language has been announced. The specification of SPARQL was a milestone on the way to fulfil
the vision, but the implementation attempts show that there is a need for further research in the
area. Some of the products are already available. This thesis is evaluating five of them using the
data set based on DBpedia.org. Firstly each of the packages is described taking into consideration
the documentation, the architecture and usability. The second part is testing the ability to load
efficiently a significant amount of data and afterwards to compute in reasonable time results of
the sample queries, which includes the most important structures of the language. The conclusion
shows that although some of the packages seem to be very advanced and complex products, they
still have some problems with processing queries based on basic specification. The Semantic Web
and its key technologies are very promising, but they need some more stable implementations to
become popular.
1
CONTENTS
Contents
Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1. Semantic Web. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1. Origins of the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2. From the Web of documents to the Web of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3. World Wide Web model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4. The Semantic Web’s Foundations – the Layer Cake . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5. The Semantic Web – Today and in the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2. SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1. RDF – data model for Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2. Querying the Semantic Web. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1. Semantic Web as a distributed database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.2. Semantic Web queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3. The SPARQL query language for RDF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4. Implementation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5. SPARQL’s syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6. Review of Literature about SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3. The implementations of SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1. Testing methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.1. DBpedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1.2. Ontology and test queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2. OpenRDF Sesame 2.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2
CONTENTS
3.2.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3. OpenLink Virtuoso 5.0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4. Jena Semantic Web Framework 2.5.5 with ARQ 2.2, SDB 1.1 and Joseki 3.2. . . . . . 93
3.4.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.4.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.4.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.5. Pyrrho DBMS 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.5.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.5.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.5.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.5.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.6. AllegroGraph RDFStore 3.0.1 Lisp Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.6.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.6.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.6.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.6.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3
LIST OF FIGURES
List of Figures
1.1. W3C’s Semantic Web Logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2. Semantic Web’s “layer cake” diagram Source: http://www.w3.org/2007/03/layerCake.png,
[12.02.2008]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1. Structure of RDF triple, after Passin (2004). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2. RDF statements. Source: DBpedia (http://www.dbpedia.org), RDF/XML vali-
dated by http://www.rdfabout.com/demo/validator/validate.xpd, [12.03.2008] . . . . 22
2.3. RDF graph. Based on: DBpedia (http://www.dbpedia.org), [12.03.2008] . . . . . . . . . 24
2.4. RDF statements in Turtle syntax. Source: DBpedia (http://www.dbpedia.org),
[12.03.2008] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5. The history of SPARQL’s specification. Based on SPARQL Query Language for
RDF (2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6. SPARQL implementation model. Source: Herman (2007a) . . . . . . . . . . . . . . . . . . . . . 32
2.7. The process of transforming calendar data from XHTML extended by hCalendar
microformat into RDF triples. Source: GRDDL Primer (2007). . . . . . . . . . . . . . . . . . 35
2.8. Simple SPARQL query with the result. Source: DBpedia (http://www.dbpedia.org),
[12.04.2008] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.9. Application of CONSTRUCT query result form with the results of the query seri-
alized in Turtle syntax. Source: DBpedia (http://www.dbpedia.org), [12.04.2008] . . 38
2.10. SPARQL query presenting universities with its number of students, number of
staff and optional name of the headmaster with some filtering applied. Below are
the results of the query. Source: DBpedia (http://www.dbpedia.org), [20.04.2008] . 39
2.11. Structure of RDF tuple, after Cyganiak (2005b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.12. Selection (𝜎) and projection (𝜋) operators, after Cyganiak (2005b). . . . . . . . . . . . . . . 44
4
LIST OF FIGURES
2.13. SPARQL query transformed into relational algebra tree, after Cyganiak (2005b). . . 45
3.1. The status of datasets interlinked by the Linking Open Data project. Source:
http://richard.cyganiak.de/2007/10/lod/lod-datasets/, [12.06.2008]. . . . . . . . . . . . . . . 57
3.2. Querying on-line DBpedia SPARQL endpoint with Twinkle. . . . . . . . . . . . . . . . . . . . 61
3.3. Query testing full-text searching capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4. Selective query with UNION clause. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5. Query with numerous selective joins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6. Query with nested OPTIONAL clauses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7. CONSTRUCT clause creating new graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8. ASK query that evaluates the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.9. Query returning all available triples for the particular resource. . . . . . . . . . . . . . . . . . 65
3.10. Two versions of GRAPH queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.11. Architecture of Sesame. Source: User Guide for Sesame 2.1 (2008). . . . . . . . . . . . . . 68
3.12. The interface of Sesame Server.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.13. Sesame Console with a list of available repositories. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.14. Sesame Workbench – exploring the resources in the repository based on a native
storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.15. Graph comparing loading times for OpenRDF Sesame using different storages. . . . 76
3.16. Graph comparing execution times of testing queries against different repositories. . 79
3.17. Architecture of Virtuso Universal Server. Source: Openlink Software (2008). . . . . . . 83
3.18. OpenLink Virtuoso Conductor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.19. OpenLink Virtuoso’s SPARQL endpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.20. Interactive SPARQL endpoint with visualisation of one of the test queries. . . . . . . . . 87
3.21. Architecture of Jena Semantic Web Framework version 2.5.5. Source: Wilkinson,
Sayers, Kuno & Reynolds (2004). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.22. Graph comparing loading times for SDB using different backened. . . . . . . . . . . . . . . 99
3.23. Graph comparing average loading times for SDB using different backened. . . . . . . . 103
3.24. Querying SDB repository using command line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.25. Joseki’s SPARQL endpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.26. Architecture of Pyrrho DB. Source: Crowe (2007). . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.27. Evaluation of the first test query against Pyrrho DBMS using provided RDF client. 113
5
LIST OF FIGURES
3.28. Pyrrho Database Manager showing local database sparql with the data stored in
Rdf$ table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.29. High-level class diagram of AllegroGraph. Source: AllegroGraph RDFStore (2008).119
3.30. The process of loading AllegroGraph server and querying a repository using Alle-
gro CL environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.31. Graph comparing average loading times the best performing configurations. . . . . . . 133
6
LIST OF TABLES
List of Tables
3.1. Summary of loading data into OpenRDF Sesame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2. Summary of evaluating test queries on OpenRDF Sesame. . . . . . . . . . . . . . . . . . . . . . 78
3.3. Summary of loading data into OpenLink Virtuoso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.4. Summary of evaluating test queries on OpenLink Virtuoso. . . . . . . . . . . . . . . . . . . . . 90
3.5. Summary of loading data using SDB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.6. Summary of evaluating test queries on repositories managed by SDB.. . . . . . . . . . . . 106
3.7. Summary of evaluating test queries against Pyrrho Professional. . . . . . . . . . . . . . . . . 116
3.8. Summary of loading data into AllegroGraph repository. . . . . . . . . . . . . . . . . . . . . . . . 123
3.9. Summary of evaluating test queries on AllegroGraph RDFStore. . . . . . . . . . . . . . . . . 125
3.10. Summary of loading data into tested implementations – configurations that had
the best performance for each implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.11. Summary of performing test queries – configurations that had the best perfor-
mance for each implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7
INTRODUCTION
Introduction
In the late 1980’s the Internet was becoming internationally established. However retrieving in-
formation from remote computer systems was a challenge due to the lack of unified protocol
for accessing information. In the same time Tim Berners-Lee, a physicist in CERN Laboratory in
Switzerland, started to work on a protocol that would allow easier access to information distributed
over many computers. In 1989, with help from Robert Cailliau, Tim Berners-Lee published a pro-
posal for the new service - World Wide Web. That was the beginning of the revolution. Within a
few years WWW became the most popular service in the Internet.
In 1994 Tim Berners-Lee launched a World Wide Web Consortium (W3C) that started to work on
standardising the technologies that were to extend the functionality of WWW. That was the time
when webpages became dynamic, but the “golden years” were to come soon. WWW was spotted
by the business community and the revolution was spread around the world.
Now we can truly say that hyperlinks have revolutionised our life - the way we publish infor-
mation, media, the way we buy and sell goods, the way we communicate. Almost everybody in
developed countries has personalised email address and treats the Internet as regular tool that helps
in everyday life. We can undoubtedly agree that the Internet is one of the pillars of the revolution
that is transforming the developed world into a knowledge-driven society.
However some visionaries claim that this is not yet the Web of data and information. The meaning
of today’s Web content is only accessible for humans. Although search engines have become very
powerful tools, the quality of the search results is relatively low. What is more, the results contains
only links to webpages, where possibly the information may be found. Users still play the main
role in processing information published in the Internet.
Tim Berners-Lee was aware of all the imperfections of the Web. In the end of the 1990’s he pro-
posed the extension to the current Web that he called the Semantic Web. The specialists announced
8
INTRODUCTION
a revolution – Web 3.0. However the implementation of that vision turned out to be more complex
than expected. The revolution was replaced by evolution.
In this thesis I will focus on one of the aspects of Semantic Web – handling semantic data. Firstly
the vision of the Semantic Web along with basic technologies will be presented. Then I will
examine what expectations derive from the Semantic Web’s foundation for the technologies that
will be responsible for accessing data on the Web. In the following chapter the W3C’s approach,
SPARQL query language, will be presented together with a short introduction into semantic data
model and the problem of querying the Semantic Web. SPARQL will be discussed in details
including the syntax, the implementation models and a review of available literature about the
technology. The practical part of the research will involve a review of a number of available
implementations of SPARQL, which are going to be a subject of some basic usability tests. Firstly
the methodology will be presented together with a description of the data set used for testing. Then
each of examined implementations will be reviewed and tested presenting the findings. Finally the
implementations will be compared when possible and some conclusion will be drawn.
9
SEMANTIC WEB
1. Semantic Web
“The Semantic Web is not a separate Web
but an extension of the current one,
in which information is given well-defined meaning,
better enabling computers and people to work in cooperation.”
(Berners-Lee, Hendler & Lassila 2001)
1.1. Origins of the Semantic Web
The above quotation comes from one of the best known articles about the Semantic Web1 – “The
Semantic Web” published in the year 2001 in Scientific American. It is considered as the initia-
tor of the “semantic revolution” in IT. In fact, due to its popularity, a worldwide discussion has
emerged and some implementation efforts have commenced, but the first ideas were presented by
Tim Berners-Lee earlier in his book, “Weaving the Web: Origins and Future of the World Wide
Web” (Berners-Lee & Fischetti 1999).
Figure 1.1: W3C’s Semantic Web Logo
From the very beginning he was thinking about
the Web as the universal network, where docu-
ments will be connected to each other by their
meaning in a way that enables automatic process-
ing of information. In “Weaving the Web” he
summarised not only his work on developing the Web in the current form, but he was also try-
ing to answer the questions about the future of the Web.
1
Google Scholar finds it cited in 5304 articles what gives it a first place for searching phrase “semantic web”.
Source: http://scholar.google.co.uk/scholar?hl=en&lr=&q=semantic++web&btnG=Search. Retrieved on 2008.01.29.
10
SEMANTIC WEB
Even before his article in Scientific American, Tim Berners-Lee and scientists gathered around
the World Wide Web Consortium (W3C) started to work on technologies that will form the basis
for the Semantic Web in the future2. They were presenting the vision in numerous lectures around
the world and supporting initiatives for deploying these technologies in some specific knowledge
areas. The first document, “Semantic Web Roadmap” (Berners-Lee 1998), where ideas about the
architecture were described, was published in September 1998.
1.2. From the Web of documents to the Web of data
The word “semantics”, according to Encyclopedia Britannica Online3, means “the philosophical
and scientific study of meaning”. The keyword is the word “meaning”.
The current version of the Web, that was implemented in 1990’s, is based on the mechanism of
linking between documents published on web servers. However despite its universality, the mech-
anism of hyperlinks does not allow a transfer of the meaning of the content between applications.
That inability prevents computers from using the Web content to automate everyday activities.
Computers just do not understand the information they are processing and displaying so human
involvement is needed to put the information into context and thus exchange semantics between the
systems. That problem also occurs while exchanging data between the computer systems used in
business. Different standards of storing data in applications require the use of custom-built parsers
– this increases costs and complexity or may lead to many extraction errors and data inconsistency.
The Semantic Web vision envisages that computers should be able to search, understand and use
the information they process with a little help from additional data. However there are different
ideas what that vision involves. Passin (2004, p.3) states 8 of them. The most important from the
perspective of that thesis is the vision of the Semantic Web as a distributed database. According
to Berners-Lee, Karger, Stein, Swick & Weitzner (2000), cited in Passin (2004), the Semantic
Web is about to present all the databases and logic rules allowing them to interconnect and create
a large database. Information should be easily accessed, linked and understood by computers.
2
First working draft of RDF specification was published in October 1997. RDF Model and Syntax specification was
released as W3C Recommendation a year later, in February 1999.
3
Encyclopedia Brytannica Online, http://www.britannica.com/eb/article-9110293/semantics. Retrieved on
2008.01.29.
11
SEMANTIC WEB
Data should be connected by relations to its meaning.
That goal can be achieved by extending the existing databases by additional descriptions of data,
usually called meta data. That supplementary information enables advanced indexing and discov-
ery of decentralised information. Moreover, searching and retrieval of information will be auto-
mated by software agents. These are dedicated applications that communicate with other services
and agents on the Web, and with the help of artificial intelligence can provide improved results or
even follow certain deduction processes. The machine-readable data will be accessible as services
over the Web that will allow computers to discover and process easily all the required information.
What is more the great amount of data that is available outside databases, e.g. static webpages,
will be understandable by machines due to semantic annotations and defined vocabularies.
1.3. World Wide Web model
Today’s model of the World Wide Web is based on a few simple principles. The most basic one
assumes that when a Web document links to another, the linked document can be considered as a
resource. In the Semantic Web, resources are identified using unique Uniform Resource Identifier
(URI). In the current Web, resources such as files or web pages are identified by standardised
Uniform Resource Locators (URLs), which are a kind of URIs, but extended with the description
of its primary access method (e.g. http:// or ftp://). The concept of URI says that resources
may represent tangible things like files and non-tangible ideas or concepts, which even does not
have to exist, but can be thought about. What is more, the resources can be fixed or change
constantly and they are still represented by the same URI.
Over the Web the messages are being sent using the HTTP protocol4, which consists of a small
set of commands and makes it easy to implement in all kind of network software (web servers,
browsers). Although some extensions, like cookies or SSL/TLS encryption layer, are being used,
the original version of protocol does not support security or transaction processing.
Another principle of the WWW is its decentralisation and scalability. Every computer connected
to the Internet can host a web server, and this makes the Web easily extendible. There is no central
4
Hypertext Transfer Protocol (HTTP) – communication protocol used to transfer information between client and
server deployed in application layer (according to TCP/IP model). It was originally proposed by Tim Berners-Lee in
1989.
12
SEMANTIC WEB
authority that maintains the infrastructure. What is more, every request from client to server is
treated independently. The HTTP protocol is stateless, and this makes it possible to cache the
responses and decrease network traffic.
The Web is open – resources can be added freely. It is also incomplete, and this means that
there is no guarantee that every resource is always accessible. That implies the next attribute –
inconsistency. The information published on-line does not have to be always true. It is possible
that two resources can easily deny each other. Resources are also constantly changing. Due to
the features of HTTP protocol and utilisation of caching servers it may happen that there are two
different versions of the same resource. These aspect raise a very serious requirement on software
agents that attempt to draw conclusions from data found on the Web.
1.4. The Semantic Web’s Foundations – the Layer Cake
The Semantic Web, as an extension of the current Web, should follow the same rules as the current
model. According to that all resources should use URIs to represent objects. The Semantic Web
refers also to non-addressable resources that cannot be transferred via the network. Currently that
feature was not used as the most popular URIs – URLs, were referring to tangible documents. The
basic protocol should continue to have a small set of commands and retain no state information.
It should remain decentralised, global and operate with inconsistent and incomplete information
with all the advantages of caching of information.
The W3C, as the main organisation that is developing and promoting standards for the Seman-
tic Web, has created their own approach to its architecture. The first overview was presented in
Berners-Lee (1998) and it has been evolving together with the evolution and development of the
technologies involved. W3C published a diagram presenting the structure and dependencies be-
tween them. All the technologies are shown as layers where higher ones depend on underlying
technologies. Each layer is specialised and tends to be more complex than the layers below. How-
ever they can be developed and deployed relatively independently. The diagram is known as the
“Semantic Web layer cake”.
Description of the layers depicted in Figure 1.4 are as follows:
∙ URI/IRI — According to the Semantic Web vision all the resources should have their identi-
13
SEMANTIC WEB
Figure 1.2: Semantic Web’s “layer cake” diagram Source:
http://www.w3.org/2007/03/layerCake.png, [12.02.2008]
fiers encoded using URIs. The Internationalized Resource Identifier (IRI) is a generalisation
of URI extended by support for Universal Character Set (Unicode/ISO 10646).
∙ Extensible Markup Language (XML) — General-purpose markup language that allows to
encode user-defined structures of data. In the Semantic Web XML is used as a framework
to encode data but provides no semantic constraints on its meaning. XML Schema is used
to specify the structure and data types used in particular XML documents. XML is a stable
technology commonly used for exchanging data. It became a W3C Recommendation in
February 1998.
∙ Resource Description Framework (RDF) — a flexible language capable of describing data
and meta data. It is used to encode a data model of resources and relations between them
using XML syntax. RDF was introduced as a W3C Recommendation a year later than XML,
in February 1999. Semantic data models can be also serialized in alternative notations like
Turtle, N-Triples or TriX.
∙ RDF Schema (RDFS) — Used as a framework for specifying basic vocabularies in RDF
documents. RDFS is built on top of RDF that extends it by a few additional classes describ-
ing relations and properties between resources.
14
SEMANTIC WEB
∙ Rule: Rule Interchange Format (RIF) — It is a family of rule languages that are used for
exchanging rules between different rule-based systems. Each RIF language is called a “di-
alect” to facilitate the use of the same syntax for similar semantics. Rules exchanged by
using RIF may depend on or can be used together with RDF and RDF Schema or OWL data
models. RIF is a relatively new initiative: the W3C’s RIF Working Group was formed in
November 2005 and first working drafts were published on 30 November 2007.
∙ Query: SPARQL — A query language designed for RDF that also includes specification
for accessing data (SPARQL Protocol) and representing the results of SPARQL queries
(SPARQL Query Results XML Format).
∙ Ontology: Web Ontology Language (OWL) — Used to define vocabularies and to specify
the relations between words and terms in particular vocabularies. RDF Schema can be
employed to construct simple ontologies. However OWL was the language designed to
support advanced knowledge representation in the Semantic Web. OWL is a family of 3
sublanguages: OWL-DL and OWL-Lite based on Description Logics and OWL-Full, which
is a complete language. All three languages are popular and used in many implementations.
OWL became a W3C Recommendation in February 2004.
∙ Logic — Logical reasoning draws conclusions from a set of data. It is responsible for apply-
ing and evaluating rules, inferring facts that are not explicitly stated, detecting contradictory
statements and combining information from distributed sources. It plays a key role in gath-
ering information in the Semantic Web
∙ Proof — Used for explaining inference steps. It can trace the way the automated reasoner
deducts conclusions, validate it and, if needed, adjust the parameters.
∙ Trust — Responsible for authentication of services and agents together with providing ev-
idence for the reliability of data. This is a very important layer as the Semantic Web will
achieve its full potential only when there is a trust in its operations and the quality of data.
∙ Crypto — Involves the deployment of Public Key Infrastructure, which can be used to au-
thenticate documents with digital signature. It is also responsible for secure transfer of
information.
15
SEMANTIC WEB
∙ User Interface and Applications — This layer encompasses tools like personal software
agents that will interact with end-users and the Semantic Web together with Semantic Web
Services, which are able to communicate between each other to exchange data and provide
value for the users.
The diagram in Figure 1.4 presents the most recent version of the architecture. The original archi-
tecture was single-stacked – the layers were placed one after another (except the security layer).
However the years of research on the particular technologies has shown that it is impossible to
separate the layers. Kifer, de Bruijn, Boley & Fensel (2005) discuss the interferences between
technologies also taking into consideration the technologies that were not developed by W3C
(e.g. SWRL5, SHOE6). The conclusion is that the multi-stack architecture is a better way of show-
ing the different features of the technological basis for the rule and ontology layers.
Antoniou & van Harmelen (2004, p.17) suggest that two principles should be followed when
considering the diagram: downward compatibility and upward partial understanding. The first
one assumes that applications operating on certain layers should be aware and able to use the
information written at lower levels. Upward partial understanding says that applications should at
least partially take advantage of information available at higher layers.
1.5. The Semantic Web – Today and in the Future
Although the Semantic Web has strong foundations in research results, not all of the technologies
presented in Figure 1.4 are yet developed and implemented. Only RDF(S)/XML and OWL stan-
dards are stable and implementations are available. SPARQL and RIF have appeared quite recently
and the implementations are in development phase. The higher layers are still under research.
The existing technologies are becoming popular. There are many tutorials and books that explain
5
Semantic Web Rule Language (SWRL) – proposal for Semantic Web rules interchange language that combines
simplified OWL Web Ontology Language (OWL DL and OWL Lite) with RuleML. The specification was created by
National Research Council of Canada, Network Inference and Stanford University and submitted to W3C in May 2004.
Source: http://www.w3.org/Submission/SWRL/. Retrieved on: 16.02.2008.
6
Simple HTML Ontology Extension (SHOE) – small extension to HTML that allows to include machine-
processable meta data in static webpages. SHOE was developed around 1996 by James Handler and Jeff Heflin.
Source: http://www.cs.umd.edu/projects/plus/SHOE/. Retrieved on: 16.02.2008.
16
SEMANTIC WEB
how to deploy the RDF or create ontologies. Developers are working within active communities
(e.g. http://www.semanticweb.org/). There are many implementations that support the RDF model
including editors, stores for datasets and programming environments7. Some of them are commer-
cial products (e.g. Siderean’s Seamark Navigator used by Oracle Technology Network portal8),
some are being developed by Open Source communities, e.g. Sesame.
Also a number of vocabularies and ontologies have been developed. Very popular vocabularies
are Dublin Core9 and Friend of a Friend10, which were created by non-commercial initiatives11.
Health care and life sciences is a sector where the need for integrating diverse and heterogeneous
datasets evoked the creation of the first large ontologies, e.g. GeneOntology12 that describes genes
and gene product attributes or The Protein Ontology Project13 that classifies a knowledge about
proteins. Other disciplines are also developing their ontologies, like eClassOwl14 that classifies
and describes products and services for e-business or WordNet15 – a semantic lexicon for English
language. We can find ontologies that integrate data from environmental sciences (e.g. climatol-
ogy, hydrology, oceanography) or are deployed in a number of e-government initiatives16. Another
source of meta data has arisen along with Web 2.0 portals known as social software. The commu-
nities of contributors (folksonomies) interested in particular information, describe it with tags or
keywords and publish it on-line. Although tagging offers a significant amount of structured data it
is being developed to meet different goals than ontologies, which are defining data more carefully,
taking into consideration relations and interactions between datasets.
Despite its wider adoption, the OWL family needs more reliable tools that support modelling and
application of ontologies that might be used by non-technical users. On the other hand we cannot
just choose any URI and search existing data stores – the data exposure revolution has not yet
happened (Shadbolt, Berners-Lee & Hall 2006).
7
The list of all implementations is available on W3C Wiki – http://esw.w3.org/topic/SemanticWebTools.
8
Source: OTN Semantic Web (Beta), http://www.oracle.com/technology/otnsemanticweb/index.html, 2008.02.25.
9
Dublin Core Metadata Initiative, http://www.dublincore.org/
10
The Friend of a Friend (FOAF) project, http://www.foaf-project.org/
11
There are webpages where available vocabularies are listed, e.g. SchemaWeb (http://www.schemaweb.info/).
12
GeneOntology, http://www.geneontology.org/
13
The Protein Ontology Project, http://proteinontology.info/
14
eClassOwl, http://www.heppnetz.de/projects/eclassowl/
15
WordNet, http://wordnet.princeton.edu/
16
Integrated Public Sector Vocabulary was created in United Kingdom, http://www.esd.org.uk/standards/ipsv. Re-
trieved on 1.03.2008.
17
SEMANTIC WEB
According to Herman (2007b) the Semantic Web, once only of interest of academia, has been
already spotted by small businesses and start-ups. Now the idea is becoming attractive to large
corporations and administration. Major companies offer tools or systems based on the Semantic
Web concept. Adobe has created a labelling technology that allows meta data to be added to most
of their file formats17. Oracle Corporation is not only supporting RDF in their products but is also
using RDF as a base for their Press Room18. The number of companies that are participating in
W3C Semantic Web Working Groups is increasing. Corporate Semantic Web was chosen by Gart-
ner in 2006 as the top emerging technology that will improve the quality of content management,
system interoperability and information access. They predict that it will take 5 to 10 years for
Semantic Web technology to become reliable (Espiner 2006).
Although RDF and OWL are gaining popularity there is some criticism around these technologies.
It is unclear how to extract RDF data from relational databases. It is possible to do it semi-
automatically, but current mechanisms still require a huge amount of data to be manually corrected.
Also there will be an increase in costs of preparing data if it has to be published in format accessible
for machines (RDF) and adjusted for humans to read. The XML syntax of RDF itself is not human-
friendly. To overcome that problem the GRDDL19 mechanism was created. It potentially allows
binding between XHTML and RDF with the use of XSLT.
Another concern is about censorship, as semantic data will be easily accessible, it will be also easy
to filter data or block it thoroughly. Authorities may control the creation and viewing of controver-
sial information as its meaning will be more accessible for automated content-blocking systems.
Also the popularity of FOAF profiles with geo-localisation will decrease users’ anonymity.
There is still a need to develop and standardize functionalities like simpler ontologies, the support
for fuzzy logic and rule based reasoning. There are some initiatives like RIF to regulate auto-
mated reasoning, but there is a lack of standards in that field. Different knowledge domains are
implementing different approaches to inference – the most suitable in particular cases. Also the
shape of the layers responsible for trust, proof and cryptography still remains a puzzle. Developing
17
Extensible Metadata Platform (XMP) is supported by major Adobe’s products like Adobe Acrobat, Adobe Photo-
shop or Adobe Illustrator. Adobe has also published a toolkit that allows integrating XMP into other applications. XMP
Toolkit is available under the BSD licence. Source: http://www.adobe.com/products/xmp/index.html
18
Oracle Press Releases, http://pressroom.oracle.com/
19
Gleaning Resource Descriptions from Dialects of Language (GRDDL), became a W3C Recommendation on
11.09.2007, http://www.w3.org/TR/grddl/. Retrieved on 1.03.2007.
18
SEMANTIC WEB
ontologies is an additional challenge as interoperability, merging and versioning remains unclear.
Antoniou & van Harmelen (2004, p.225) finds the problem with ontology mapping as probably
the most complicated, as there is no central control over application of standards and technologies
during modelling ontologies in open Semantic Web environment.
The Semantic Web vision itself was also criticised. Even Tim Berners-Lee recently said that even
though the idea is simple, it still remains unrealized (Shadbolt et al. 2006). Walton (2006, p.109)
raises the layered model for discussion as the present shape imply certain difficulties for the design
of software agents – providing a unified view of independent layers might be a challenge.
The Semantic Web, like the current Web, relies on the principle that people provide reliable con-
tent. Other important aspects are the fundamental design decisions and their consequences in
creating and deploying standards. Both are being fulfilled – particular communities are working
on RDF datasets and there is a broad discussion about each of the layers of the Semantic Web
focused around W3C Working Groups. As Shadbolt et al. (2006) says, the Semantic Web con-
tributes to Web Science, a science that is concerned with distributed information systems operating
on global scale. It is being encouraged by the achievements of Artificial Intelligence, data mining
and knowledge management.
19
SPARQL
2. SPARQL
2.1. RDF – data model for Semantic Web
The vision of the Semantic Web required new approach to handling data and metadata while it
came to applications. To meet the expectations, W3C in October 1997 published a working draft
for a new universal language to form a basis for the Semantic Web. The Resource Description
Framework (RDF) is providing a standard way to describe, model and exchange information about
resources. It was created as a high-level language and thanks to its low expressiveness, the data is
more reusable. RDF Model and Syntax Specification became W3C recommendation in February
1999. The current version of the specification was published in February 2004. The RDF is in
fact a data model encoded with XML-based syntax. It provides a simple mechanism to make
statements about resources. RDF has a formal semantics that is the basis for reasoning about the
meaning of an RDF dataset.
The RDF statements are usually called triples as they consist of three elements: subject (re-
source), predicate (property) and object (value). The triples are similar to simple sentences with
subject-verb-object structure. The structure of an RDF triple can be represented as a logical for-
mula 𝑃(𝑥, 𝑦) where binary predicate 𝑃 relates object 𝑥 to object 𝑦. Figure 2.1 depicts its struc-
ture (Passin 2004).
(
𝑠𝑢𝑏𝑗𝑒𝑐𝑡
⏞ ⏟
town1 ,
𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒
⏞ ⏟
name ,
𝑜𝑏𝑗𝑒𝑐𝑡
⏞ ⏟
”Paisley” )
⏟ ⏞
𝑡𝑟𝑖𝑝𝑙𝑒
Figure 2.1: Structure of RDF triple, after Passin (2004).
The subject of a triple is a resource identified by an URI. An URI reference is usually presented
20
SPARQL
in URL style extended by fragment identifier – the part of the URI that follows “#”1. A fragment
identifier relates to some portion of the resource. Also different URI schemes and its variations are
allowed, however the generic syntax has to remain as defined. The whole URI should be unique but
not necessarily should enable access to resource. The problem with URI arises with names of the
objects that are not unique – the mechanism allows anyone to make statements about any resource.
Another technique to identify a resource is to refer to its relationships with other resources. The
RDF accepts resources that are not identified by any URI. These resources are known as blank
nodes or b-nodes and are given internal identifiers, which are unique and not visible outside the
application. Blank nodes can only stand as subjects or objects in particular triple.
Predicates are special kind of resources, also identified by URIs, that describe relations between
subjects and objects. Objects can be named by URIs or by constant values (literals) represented by
character strings. These are the only elements that can be represented by plain string. Plain literals
are strings extended by optional language tag. Literals extended by datatype URI are called typed
literals. Objects are the only elements that can be represented by plain strings. Literals can be
extended by the definition of the datatype, then the whole structure is called typed literal. RDF,
unlike database systems or programming languages, does not have built-in datatypes – it bases on
ones inherited from XML Schema2, e.g. integer, boolean or date. The use of externally defined
datatypes is allowed, but in practice not popular (Manola & Miller 2004).
The full triples notation requires that URIs are written as the complete name in angle brack-
ets. However many RDF applications uses the abbreviated forms for convenience. The full URI
reference is usually very long (e.g. <http://dbpedia.org/resource/Paisley>). It
is shortened to prefix and resource name (e.g. dbpedia:Paisley). Prefix is assigned to the
namespace URI. That mechanism is derived from XML syntax and is known as XML QNames3.
1
The Uniform Resource Indetifier (URI) is defined by RFC 3986. The generic syntax is URI = scheme ":"
hier-part [ "?" query ] [ "#" fragment ]. Source: http://tools.ietf.org/html/rfc3986, [05.05.2008].
2
The XML Schema datatypes are defined in W3C Recommendation “XML Schema Part 2: Datatypes” (Avail-
able at: http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/), which is a part of specification of XML Schema
language.
3
The QNames mechanism is described in “Using Qualified Names (QNames) as Identifiers in XML Content” avail-
able at: http://www.w3.org/2001/tag/doc/qnameids.html.
21
SPARQL
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfschema="http://www.w3.org/2000/01/rdf-schema#"
xmlns:ns="http://xmlns.com/foaf/0.1/"
xmlns:property="http://dbpedia.org/property/">
<rdf:Description rdf:about="http://dbpedia.org/resource/Paisley">
<rdfschema:label xml:lang="en">Paisley</rdfschema:label>
<ns:img rdf:resource="http://upload.wikimedia.org/wikipedia/en/0/0d/RenfrewshirePaisley.png" />
<ns:page rdf:resource="http://en.wikipedia.org/wiki/Paisley" />
<rdfschema:label xml:lang="pl">Paisley (Szkocja)</rdfschema:label>
<property:reference rdf:resource="http://www.paisleygazette.co.uk" />
<property:latitude rdf:datatype="http://www.w3.org/2001/XMLSchema#double">55.833333</property:latitude>
<property:longitude rdf:datatype="http://www.w3.org/2001/XMLSchema#double">-4.433333</property:longitude>
</rdf:Description>
<rdf:Description rdf:about="http://dbpedia.org/resource/University_of_the_West_of_Scotland">
<property:city rdf:resource="http://dbpedia.org/resource/Paisley" />
<property:name xml:lang="en">University of the West of Scotland</property:name>
<property:established rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1897</property:established>
<property:country rdf:resource="http://dbpedia.org/resource/Scotland" />
</rdf:Description>
<rdf:Description rdf:about="http://dbpedia.org/resource/William_Wallace">
<property:birthPlace rdf:resource="http://dbpedia.org/resource/Paisley" />
<property:death rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1305-08-23</property:death>
<ns:name>William Wallace</ns:name>
</rdf:Description>
</rdf:RDF>
Figure 2.2: RDF statements. Source: DBpedia (http://www.dbpedia.org), RDF/XML validated by
http://www.rdfabout.com/demo/validator/validate.xpd, [12.03.2008]
22
SPARQL
Figure 2.2 presents a number of triples serialized in RDF/XML syntax using the most basic struc-
tures. The preamble of the listing contains the XML Declaration that declares the namespaces
(QNames) that are used in the document. Every subject is placed in <rdf:Description> tag.
It is extended by URI placed in rdf:about attribute. Predicates are called property elements
and they are placed within subject tag. Subject can contain one or multiple outgoing predicates.
In Figure 2.2 every subject has a number of properties. Each property has the type of relation
stated and the name of the object as attribute. Properties can also be extended by the datatype or
language attributes.
There are many methods of representing RDF statements. They can be encoded in XML syntax,
but a graph-based view is a very popular representation. The RDF graph model is a collection
of triples represented as a graph, where subjects and objects are depicted as graph nodes and
predicates are represented by arc directed from the subject node to object node. The example
of RDF graph is presented in Figure 2.3. In that case triples from Figure 2.2 were transformed
into graph. The nodes referenced by URIs are shown in oval-shaped figures. Literals are written
within rectangles. Every arc has the URI of the relationship stated. Graph-based view, due to its
simplicity, is used for explaining the concept of triple.
The other popular serialization formats for RDF are Notation3 (N3), JSON or Turtle. The RDF
triples from Figure 2.2 encoded in Turtle syntax are presented in Figure 2.4. In that case, the triples
are shown in actual subject-verb-object format. Turtle syntax is very straightforward. Every triple
is written in one line ended by dot sign. Long URIs can be replaced by short prefix names declared
using @prefix directive. Literals are simply extended by language suffix or by datatype URI.
Turtle allows some abbreviations – when more than one triple involves the same subject it can be
stated only once followed by the group of predicate-object pairs. A similar operation can be done
when subject and predicate are constant.
23
SPARQL
Figure 2.3: RDF graph. Based on: DBpedia (http://www.dbpedia.org), [12.03.2008]
24
SPARQL
@PREFIX dbpedia: <http://dbpedia.org/resource/> .
@PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@PREFIX foaf: <http://xmlns.com/foaf/0.1/> .
@PREFIX dbpedia_prop: <http://dbpedia.org/property/> .
@PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> .
dbpedia:Paisley rdfs:label "Paisley"@en .
dbpedia:Paisley foaf:img
<http://upload.wikimedia.org/wikipedia/en/0/0d/RenfrewshirePaisley.png> .
dbpedia:Paisley foaf:page <http://en.wikipedia.org/wiki/Paisley> .
dbpedia:Paisley rdfs:label "Paisley (Szkocja)"@pl .
dbpedia:Paisley dbpedia_prop:reference <http://www.paisleygazette.co.uk> .
dbpedia:Paisley dbpedia_prop:latitude "55.833333"ˆˆxsd:double .
dbpedia:Paisley dbpedia_prop:longitude "-4.433333"ˆˆxsd:double .
dbpedia:University_of_the_West_of_Scotland [
dbpedia_prop:city dbpedia:Paisley;
dbpedia_prop:name "University of the West of Scotland"@en;
dbpedia_prop:established "1897"ˆˆxsd:integer;
dbpedia_prop:country dbpedia:Scotland ] .
dbpedia:William_Wallace dbpedia_prop:birthPlace dbpedia:Paisley .
dbpedia:William_Wallace dbpedia_prop:death "1305-08-23"ˆˆxsd:date .
dbpedia:William_Wallace foaf:name "William Wallace" .
Figure 2.4: RDF statements in Turtle syntax. Source: DBpedia (http://www.dbpedia.org),
[12.03.2008]
The RDF has a few more interesting features. One of them is reification that provides the possi-
bility to make statements about other statements. Reification of the statements can provide infor-
mation about its creator or usage. It might be also used in process of authenticating the source
of information. Another feature is the possibility to create containers and collections of resources
that can be used for describing groups of things. Containers, according to the requirements, can be
represented by a group of resources or literals with defined order as an option or by a group where
members are alternatives to each other. A collection is also a group of elements but it is closed –
once created it cannot be extended by any new members.
The RDF provides a simple syntax for making statements about resources. However to define
the vocabulary that will be used in a particular dataset there is a need to use RDF Vocabulary
Description Language better known as RDF Schema (RDFS). The RDFS provides a means for
describing classes of resources and defining their properties. In addition, a hierarchy of classes
can be built. Similar to object-oriented programming every resource is an instance of one or more
classes described with particular properties.
The RDFS does not have its own syntax – it is expressed by the predefined set of RDF resources.
25
SPARQL
The resources are identified with the prefix http://www.w3.org/2000/01/rdf-schema#
usually abbreviated to rdfs: QName prefix. To understand the special meaning of the RDFS
graph the application has to provide such features, otherwise it is processed as a regular RDF
graph.
Although the RDF is supported by W3C it is not the only solution for the Semantic Web.
Passin (2004, p.60) gives an example of Topic Maps as an ISO standard4 for handling semi-
structured data. Topic Maps were originally designed for creating indexes, glossaries, thesauri
and similar. However, their features made them applicable in more demanding domains. Topic
Maps are based on a concept of topics and associations between topics and their occurrences. All
structures have to be defined in ontologies of Topic Maps. The topics are represented with empha-
sis on the collocation and the navigation – it is easier to find the particular information and browse
closely related topics. Topic Maps can be applied as a pattern for organizing information. They
can be implemented using many technologies using native XML syntax for Topic Maps (XTM)
or even RDF. Their features make them well suited to be a part of the Semantic Web even though
they are not supported by W3C.
The RDF is a language that refers directly and unambiguously to a decentralized data model and
unlike XML it is straightforward to differentiate information from the syntax. However, that
technology has some limitations. According to Jorge Cardoso (2006) RDF with RDFS is not able
to express the equivalence between terms defined in independent vocabularies. The cardinality and
uniqueness of terms cannot be preserved. What is more the disjointness of terms and unions of
classes are impossible to express with the limited functionality of RDF. There is also no possibility
to negate statements. Antoniou & van Harmelen (2004, p.68) points out another limitation –
RDF is using only binary predicates but in certain cases, it would be more natural to model a
relation with more than two arguments. In addition, the concept of properties and reification
can be misleading for the modellers. Finally, the XML syntax of RDF, being very flexible and
accessible for machine processing, is hardly comprehensible for humans.
Despite of all the disadvantages the RDF retains a good balance between complexity and expres-
siveness. What is more it has become a de facto world standard for the Semantic Web, and is
heavily supported by W3C and developers around the world.
4
Topic Maps were developed as ISO standards which is formally known as ISO/IEC 13250:2003.
26
SPARQL
2.2. Querying the Semantic Web
2.2.1. Semantic Web as a distributed database
One of the visions of the Semantic Web says that it is able to provide a common way to access,
link and understand data from different sources available on-line. The Web will become a large
interlinked database. This revolutionary approach challenges the current state of knowledge in
managing data. Currently Relational Database Management Systems (RDBMS) are some of the
most advanced software ever written. They are the largest data resources in the world. Over
30 years of experience in research and implementations has resulted in use of sophisticated mech-
anisms like query optimization, clustering or retaining ACID properties5. Now the principles of
the Semantic Web imply the need of implementing new technologies for managing semantic data.
The Semantic Web has its basic data model – RDF. Passin (2004, p.25) says that RDF data model
can be compared to Relational Data Model. In relational databases, data is organized in tables,
where every row is identified by primary key and has a defined structure. A collection of attributes
that forms a row is called a tuple. Every tuple can be divided into a number of RDF triples
where the primary key becomes the subject. Tuples can be transformed into triples, but the reverse
operation might not be possible. In general, RDF data model is less structured than database.
Every table in the relational model has its defined structure which cannot be extended6 – data
is structured and the number of attributes (properties) is known. RDF allows adding new triples
extending the information about the resource. The triples can be partitioned between different
nodes, even the ones that are not accessible. An RDBMS maintains consistency across all the data
that it manages. Walton (2006) calls this the closed-world assumption, where everything that is
not defined is false. On the contrary, in the Semantic Web, false information has to be specified
or they are just unknown – this is an open-world model. Thanks to that, RDF is more flexible.
However, such an assumption implies the possibility of inconsistency and missing information.
The results of the query vary with the availability of datasets. The returned information can be
only partial, and its size and computing time is unpredictable.
5
Atomicity, Consistency, Isolation, Durability (ACID) are the basic properties that should be fulfilled by Database
Management System (DBMS) to ensure that transactions are processed reliably.
6
In fact every RDBMS permits the modifications of the table structure (ALTER TABLE command), but altering
data model in such a way is not a regular operation so in that case can be omitted.
27
SPARQL
Walton (2006) claims that the Semantic Web data is more network structured than relational. In
RDBMS, data is defined in the relation between static tables. Queries are performed on a known
number of tables using set-based operations. In RDF, the data model before querying dataset,
has to be separated from the whole Web of constantly changing stores. The constant change of
asserted data implies that the results of the queries might be incomplete or even unavailable. What
is more, Semantic Web knowledge can be represented in different syntactic forms (RDF with
RDFS, OWL), which results in extended requirements for query languages as they have to be
aware of the underlying representation. In addition, the structure of the datasets will be unknown
to the querying engines, so they will have to rely on specified web services that will perform the
required selection on their behalf.
The Semantic Web principles put very strict constraints on the services that will manage and query
semantic data. The RDF data model ensures simplicity and flexibility so the responsibility for the
results of the queries will be borne by the query languages and automated reasoners.
2.2.2. Semantic Web queries
The new data model that was designed for the Semantic Web required new technologies that
would allow queries on semantic datasets. New query languages were needed to enable higher-
level application development. The inspiration came from well established RDB Management
Systems and Structured Query Language (SQL) that is used there for extracting relational data.
However, the relational approach could not be directly translated into the semantic data model.
The RDF data model with its graph-like model, blank nodes and semantics made the problem
more complex. The query language has to understand the semantics of RDF vocabulary to be able
to return correct information. That is why XML query languages, like XQuery or XPath, turned
out to be insufficient as they operate on lower level of abstraction than RDF (Figure 1.4).
To effectively support the Semantic Web, a query the language should have the following proper-
ties (Haase, Broekstra, Eberhart & Volz 2004):
∙ Expressiveness — specifies how complicated queries can be defined in the language. Usu-
ally the minimal requirement is to provide the means proposed by relational algebra.
∙ Closure — assumes that the result of the operation become a part of the data model, in the
28
SPARQL
case of RDF model, the result of the query should be in a form of graph.
∙ Adequacy — requires that query language working on particular data model use all its con-
cepts.
∙ Orthogonality — requires that all operations can be performed as independent from the
usage context.
∙ Safety — assumes that every syntactically correct query returns definite set of results.
Query languages for RDF were developed in parallel with RDF itself. Some of them were closer to
the spirit of relational database query languages, some were more inspired by rule languages. One
of the first ones was rdfDB, a simple graph-matching query language that became an inspiration
for several other languages. RdfDB was designed as a part of an open-source RDF database with
the same name. One of the followers is Squish that was designed to test some RDF query language
functionalities. Squish was announced by Libby Miller in 20017. It has several implementations,
like RDQL and Inkling8. RQL bases on functional approach, that supports generalized path ex-
pressions9. It has a syntax derived from OQL. RQL evolved into SeRQL. RDQL is a SQL-like
language derived from Squish. It is a quite safe language that offers limited support for datatypes.
RDQL had submission status in W3C but never became a recommendation10. A different approach
was used in the XPath-like query language called Versa11, where the main building block is the
list of RDF resources. RDF triples are used in traversal operations, which return the result of the
query. Another language is Triple12, a query and transformation language, QEL, a query-exchange
language developed as a part of Edutella project13 that is able to work across heterogeneous repos-
itories, and DQL14, which is used for querying DAML+OIL knowledge bases. Triple and DQL
represents rule-based approach.
7
RDF Squish query language and Java implementation available at: http://ilrt.org/discovery/2001/02/squish/,
[02.05.2008]
8
Inkling Architectural Overview available at: http://ilrt.org/discovery/2001/07/inkling/index.html, [02.05.2008]
9
RQL: A Declarative Query Language for RDF available at:
http://139.91.183.30:9090/RDF/publications/www2002/www2002.html, [02.05.2008]
10
http://www.w3.org/Submission/2004/SUBM-RDQL-20040109/
11
Specification of Versa is available at: http://copia.ogbuji.net/files/Versa.html, [02.05.2008].
12
Triple’s homepage is available at: http://triple.semanticweb.org/ , [02.05.2008]
13
Edutella is a p2p network that enables other systems to search and share semantic metadata. Homepage is available
at: http://www.edutella.org/edutella.shtml, [02.05.2008].
14
Specification of DQL is available at: http://www.daml.org/2003/04/dql/dql, [02.05.2008].
29
SPARQL
The variety of RDF query languages developed by different communities resulted in compatibility
problems. What is more, according to Guti´errez, Hurtado & Mendelzon (2004), different imple-
mentations were using different query mechanisms that have not been a subject of formal studies,
so there were doubts that some of them might behave unpredictably. W3C was aware of all that
weaknesses. To decrease redundancy and increase interoperability between technologies W3C had
formed in February 2004 an RDF Data Access Working Group (DAWG) that aimed to recommend
a query language, which would become a worldwide standard. DAWG divided the task into two
phases. At the beginning, they wanted to define the requirements for the RDF query language.
They reviewed the existing implementations and wanted to choose a query language that would
be a starting point for the further work in the next phase. In the second phase they prepared a
formal specification together with test cases for the RDF query language (Prud’hommeaux 2004).
In October 2004, the First Working Draft of SPARQL Query Language was published.
2.3. The SPARQL query language for RDF
DAWG worked on SPARQL specification for more than a year. After six official Working Drafts15,
in April 2006, DAWG published a W3C Candidate Recommendation for SPARQL Query Lan-
guage for RDF. However, the community involved in developing a new standard pointed out a
several weaknesses of that version of SPARQL specification and it was returned to Working Draft
status in October 2006. After a few months and one more working Draft the specification reached
a status of Candidate Recommendation in June 2007. When the exit criteria stated in the document
were met (e.g. each SPARQL feature needed to have at least two implementations and the results
of the test was satisfying), the specification went smoothly to Proposed Recommendation stage in
November 2007. Finally, the SPARQL Query Language for RDF became a W3C recommendation
on 15th of January 2008.
The word SPARQL is an acronym of SPARQL Protocol and RDF Query Language (SPARQL
15
The official W3C Technical Report Development Process assumes that work on every document starts from the
Working Draft. After positive feedback from the community there is a Candidate Recommendation being published.
When the document gathers satisfying implementation experience it moves to Proposed Recommendation status. This
mature document is waiting for the approval from W3C Advisory Committee. The last stage is the W3C Recommen-
dation, which ensures that the document is a W3C standard. Source: World Wide Web Consortium Process Document
(2005)
30
SPARQL
Figure 2.5: The history of SPARQL’s specification. Based on SPARQL Query Language for RDF
(2008)
Frequently Asked Questions 2008). In fact the SPARQL query language is closely related to two
other W3C standards: SPARQL Protocol for RDF16 and SPARQL Query Results XML Format17.
Although SPARQL is a W3C standard there are twelve open issues waiting to be resolved by
DAWG.
The SPARQL query language has an SQL-like syntax. Its queries use required or optional graph
patterns and return a full subgraph that can be a basis for the further processing. SPARQL uses
datatypes and language tags. Patterns can be also matched with the required functional constraints.
Additional features include sorting the results and limiting their number or removing duplicates.
SPARQL does not have the complete functionality that was requested by its users. Some of the
features are being implemented as SPARQL extensions. To avoid inconsistency between imple-
mentations W3C keeps a list of official SPARQL Extensions on their Wiki18. The list contains
a number of missing features including the proposal for insert, update and delete features for
SPARQL, creating subqueries or using aggregation functions.
16
SPARQL Protocol for RDF defines a remote protocol for transmitting SPARQL queries and receiving their results.
It became a W3C Recommendation in January 2008. The specification is available at: http://www.w3.org/TR/rdf-
sparql-protocol/.
17
SPARQL Query Results XML Format specify the format of XML document representing the results of SELECT
and ASK queries. It was recognized as W3C recommendation in January 2008. The specification is available at:
http://www.w3.org/TR/rdf-sparql-XMLres/.
18
The list is available at: http://esw.w3.org/topic/SPARQL/Extensions, [06.04.2008].
31
SPARQL
2.4. Implementation model
SPARQL can be used for querying heterogeneous data sources that operates on native RDF or has
an access to RDF dataset via middleware. The model of possible implementations is presented in
Figure 2.4. Middleware in that case is mapping the SPARQL query into SQL, which operates on
RDF data fitted into relational model. The main advantage of that approach is the possibility of
using the advanced features of RDBMS and benefitting from the years of experience in managing
huge amounts of data. However, the approach still requires the semantic data to be accessible
as an RDF model. Nowadays a great amount of data is still being stored in relation model. To
make it accessible it would have to be transformed into RDF data model, which would be time
consuming and may not be always possible. Most of the current computer systems operate on the
data encapsulated in relational model and revolution in such approach is very unlikely. One of the
suggested solutions is the automatic transformation of relational data into the Semantic Web with
the help of Relational.OWL (de Laborda & Conrad 2005).
Figure 2.6: SPARQL implementation model. Source: Herman (2007a)
Relational.OWL is an application independent representation format based on OWL language that
describes the data stored in relational model together with the relational schema and its semantic
32
SPARQL
interpretation. The solution consists of three layers: Relational.OWL on the top, ontology created
with Relational.OWL to represent database schema and data representation on the bottom, which
is based on another ontology. It can be applied to any RDBMS. Relational data represented by
Relational.OWL is accessible like normal semantic data, so can be queried by SPARQL. The
main advantage of such approach is the possibility of publishing relational data in the Semantic
Web with almost no cost of transforming them to RDF. What is more the changes of data stored
relationally together with its schema are automatically transferred to its semantic representation.
However all the imperfections of database schema affect the quality of the generated ontology.
To avoid that, Relational.OWL can be extended with additional manual mapping as described in
P´erez de Laborda & Conrad (2006). In that case, the possibility to generate a graph from the query
results is being used. The subgraph involves the manual adjustments of the original ontology.
Such a dataset is mapped to the target ontology and is free from the drawbacks of Relational.OWL
automatic mapping.
The technology is still under development. de Laborda & Conrad (2005) indicates only the pos-
sibility of representing relational data as a mature feature. Further studies will be directed into
supporting data exchanges and replication.
A similar approach is found in the D2RQ language (Bizer, Cyganiak, Garbers & Maresch 2007).
This is a declarative language that describes mappings between relational data and ontologies. It
is based on RDF and formally defined by D2RQ RDFS Schema
(http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1). The language does
not support Data Modification Language, the mappings are available in read-only mode. D2RQ is
a part of the wider solution called D2RQ Platform. Apart from the implementation of the language,
the Platform includes the D2RQ Engine, which translates queries into SQL, and the D2R Server,
which is an HTTP server with extended functionality including support for SPARQL.
Another interesting implementation of such approach is Automapper (Matt Fisher & Joiner 2008).
The tool is a part of a wider architecture that processes SPARQL query over multiple data sources
and returns combined query result. Automapper uses D2RQ language to create data source ontol-
ogy and mapping instance schema, both based on a relational schema. These ontologies are used
for decomposing a semantic query at the beginning of processing and translating SPARQL into
SQL just before executing it against RDBMS. To decrease the number of variables and statements
33
SPARQL
used in processing a query and to improve performance, Automapper uses SWRL rules that are
based on database constraints. The solution is available in Asio Tool Suite, a software package for
managing data created by BBN Technologies19.
The implementations mentioned above are not the only ones available. The community gath-
ered around MySQL is working on SPASQL20, a SPARQL support built into the database. Data
integration solutions, like DartGrid or SquirellRDF21, are also available. Finally the all-in-one
suits, like OpenLink Virtuoso Universal Server22, can be used for query non-RDF data stores with
SPARQL or other Semantic Web query languages.
Mapping relational databases, while having indisputable advantages, has also some limitations.
Data in RDBMS very often are messy and they do not conform to widely accepted database design
principles. To meet the expectations and provide high quality RDF data the mapping language has
to be very expressive. It should have a number of features, like sophisticated transformations,
conditional mappings, custom extensions and ability to cope with data organized at different level
of normalization.
Future users expect the data to be highly integrated and highly accessible. RDF datasets that has
relational background are still not reliable. There is a need of some studies over mechanisms of
querying multiple data sources, data sources discovery or schema mapping as the current solutions
based on RDF and OWL are insufficient.
Using a bridge between SPARQL and RDBMS is the most demanding problem, but the applica-
tions will seriously increase the availability of semantic data. However, as depicted in Figure 2.4,
it is not the only medium that SPARQL can query. Being very powerful RDF is a bit messy tech-
nology. What is more embedding it into XHTML is rather useless as applications built around
HTML do not recognise it. In addition, transforming data already available in XHTML would
need significant amount of work. To simplify the process of embedding semantic data into web
pages W3C started to work on set of extensions to XHTML called RDFa23. RDFa is a set of at-
tributes that can be used within HTML or XHTML to express semantic data (RDFa Primer 2008).
19
BBN Technologies, http://www.bbn.com/.
20
SPASQL: SPARQL Support In MySQL, http://www.w3.org/2005/05/22-SPARQL-MySQL/XTech.
21
SquirellRDF, http://jena.sourceforge.net/SquirrelRDF/.
22
Openlink Virtuoso Universal Server Platform, http://www.openlinksw.com/virtuoso/.
23
The first W3C Working Draft was published in March 2006. For the time of writing RDF has still the same status
– the latest Working Draft was published in March 2008.
34
SPARQL
It consists of meta and link attributes that are already existing in XHTML version 1 and a
number of new ones that are being introduced by XHTML version 2. RDFa attributes can extend
any HTML element, placed on document header or body, creating a mapping between the element
and desired ontology and make it accessible as an RDF triple. The attributes does not affect the
browser’s display of the page as HTML and RDF are separated. The most important advantage
of RDFa is that there is no need to duplicate data publishing it in human-readable format and in
machine-readable metadata. There are no standards of publishing RDFa attributes, so every pub-
lisher can create their own ones. Another benefit is the simplicity of reusing the attributes and
extending the already existing ones with new semantics.
RDFa in some cases is very similar to microformats. However when each microformat has defined
syntax and vocabulary, RDFa is only specifying the syntax and rely on vocabularies created by
publishers or independent ones like FOAF or Dublin Core.
Microformat is the approach to publishing metadata about the content using HTML or XHTML
with some additional attributes specific for each format. Every application that is aware of these
attributes can extract semantics from the document they were embedded in. They do not affect
other software, e.g. web browsers. There are a number of different microformats, most of them
developed by community gathered around Microformats.org. A very popular one is XFN, which
is a way to express social relationships with the usage of hyperlinks. Other common microfor-
mats are hCard and hCalendar, which are the way to embed information based on vCard24 and
iCalendar25 standards in documents.
Figure 2.7: The process of transforming calendar data from XHTML extended by hCalendar mi-
croformat into RDF triples. Source: GRDDL Primer (2007).
SPARQL is also able to query documents, which has some semantic information embedded in the
content using e.g. microformats. To process a query over such document SPARQL engine need to
24
vCard electronic business card is common standard, defined by RFC 2426 (http://www.ietf.org/rfc/rfc2426.txt), for
representing people, organizations and places.
25
iCalendar is a common format for exchanging information about events, tasks, etc. defined by RFC 2445
(http://tools.ietf.org/html/rfc2445).
35
SPARQL
know the “dialect” that was used for encoding metadata. Being aware of the barrier, W3C started
to work on universal mechanism of accessing semantics written in non-standard formats. At the
end of 2006, they introduced mechanism for Gleaning Resource Descriptions from Dialects of
Languages (GRDDL). GRDDL introduced a markup that indicates if the document includes data
that complies with the RDF data model, in particular documents written in XHTML and generally
speaking in XML. The appropriate information is written in the header of the document. Another
markup links to the transformation algorithm for extracting semantics from the document. The
algorithm is usually available as XSLT stylesheet. The SPARQL engine extracts the metadata
from the document, applying transformations fetched from the relevant file, and presents data as
in the RDF data model. The process of transforming metadata encoded in a specific “dialect” into
RDF is depicted in Figure 2.7.
SPARQL together with some related technologies was designed to be a unifying point for all
the semantic queries. SPARQL engines will be able to serve dedicated applications and other
SPARQL endpoints providing information that they can extract from the documents that are di-
rectly accessible for it. Some implementations of this mechanism already exist. One of them is the
public SPARQL endpoint to DBpedia26 that is able to return data from other semantic datastores
that are linked to its dataset.
2.5. SPARQL’s syntax
SPARQL is a pattern-matching RDF query language. In most cases, the query consists of set of
triple patterns called basic graph pattern. The patterns are similar to RDF triples. The difference
is that each of the elements can be set as a variable. That pattern is matched against RDF dataset.
The result is a subgraph of original dataset where all the constant elements of patterns are matched
and the variables are substituted by data from matched triples. The pair of variable and RDF data
matched to the variable is called a “binding”. The number of related bindings that form a row in
the result set is known as the “solution”.
The SPARQL basic syntax is very similar to SQL – it starts with SELECT clause called projection,
which identifies the set of returned variables, and ends with WHERE clause providing a basic graph
pattern. Variables in SPARQL are indicated by $ or ? prefixes. Similarly to Turtle syntax URIs
26
DBpedia public SPARQL endpoint is available at: http://dbpedia.org/sparql, [02.05.2008].
36
SPARQL
can be abbreviated using PREFIX keyword and prefix label with a definition of the namespace.
If the namespace occurs in multiple places, it can be set as a base URI. Then relative URIs, like
<property/>, are resolved using base URI. Triple patterns can be abbreviated in the same way
as in Turtle syntax – a common subject can be omitted using “;” notation and a list of objects
sharing the same subject and predicate can be written in the same line separated by “,”. The
query results can contain blank nodes, which are unique in the subgraph and indicated by “ :”
prefix.
The simple query to find a name of the university in Paisley from the dataset presented in Figure 2.4
is shown in Figure 2.8
BASE <http://dbpedia.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpedia: <property/>
SELECT DISTINCT ?city ?uniname
WHERE {
?city rdfs:label "Paisley (Szkocja)"@pl .
?uni dbpedia:city ?city .
?uni dbpedia:established "1897"ˆˆxsd:integer .
?uni dbpedia:name ?uniname .
}
city uniname
http://dbpedia.org/resource/Paisley University of the West of Scotland
Figure 2.8: Simple SPARQL query with the result. Source: DBpedia (http://www.dbpedia.org),
[12.04.2008]
SPARQL has a number of different query result forms. SELECT is used for obtaining variable
bindings. Another form is CONSTRUCT that returns an RDF dataset build on the graph pattern that
is applied to the subgraph returned by the query. This feature can be used to create RDF subgraphs
that become a base for the further processing, e.g. Relational.OWL is using it to map automatically
created ontology based on relational schema into desired ontology. Figure 2.9 presents the usage
of CONSTRUCT clause to build a subgraph according to required pattern.
Another two forms are ASK and DESCRIBE. First of them returns a boolean value that indicates
if the query pattern matches the RDF graph or not. The usage of the ASK clause is similar to
the SELECT clause, the only difference is that there is no specification of returned variables.
DESCRIBE is used to obtain all triples from RDF dataset that describe the stated URI.
37
SPARQL
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia: <http://dbpedia.org/property/>
CONSTRUCT {?uni <http://dbpedia.org/property/located_in> ?city.
?uni <http://dbpedia.org/property/has_name> ?uniname }
WHERE {
?city rdfs:label "Paisley (Szkocja)"@pl .
?uni dbpedia:city ?city .
?uni dbpedia:established "1897"ˆˆxsd:integer .
?uni dbpedia:name ?uniname .
}
Returned RDF subgraph serialized in Turtle:
<http://dbpedia.org/resource/University_of_the_West_of_Scotland>
<http://dbpedia.org/property/located_in> <http://dbpedia.org/resource/Paisley>;
<http://dbpedia.org/property/has_name> "University of the West of Scotland"@en.
Figure 2.9: Application of CONSTRUCT query result form with the results of the query serialized
in Turtle syntax. Source: DBpedia (http://www.dbpedia.org), [12.04.2008]
Every query language should provide possibilities to filter the results returned by the generic query.
SPARQL uses FILTER clause to restrict the result by adding filtering conditions. Using condi-
tions SPARQL can filter the values of the strings with regular expressions defined in XQuery 1.0
and XPath 2.0 Functions and Operators (2007) W3C specification. Also a subset of functions and
operators used in XPath27 is available – all the arithmetic and logical functions comes from that
language. However SPARQL introduces a number of new operators, like bound(), isIRI()
or lang(). All of them are described in detail in the SPARQL Query Language for RDF (2008).
There is also a possibility to use external functions defined by an URI. That feature may be used
to perform transformations not supported by SPARQL or for testing specific datatypes.
After applying filters, SPARQL returns the result of graph pattern matching. However, the list of
query solutions is in random order. Similarly to SQL, SPARQL provides a means to modify the set
of results. The most basic modifier is ORDER BY clause that orders the solutions according to the
chosen binding. The solutions can be ordered ascending, using ASC() modifier, or descending
indicated by DESC() modifier.
It is common that the solutions in result dataset are multiplied. The keyword DISTINCT ensures
that only unique triples are returned. The REDUCED modifier has similar functionality. However
27
XML Path Language (XPath) is a language to address parts of the XML document. It provides a possibilities to
perform operations on strings, numbers or boolean values. XPath is now available in version 2.0, which is a W3C
Recommendation since January 2007. Source: XML Path Language (XPath) 2.0 (2007)
38
SPARQL
when DISTINCT ensures that duplicate solutions are eliminated, REDUCED allow them to be
eliminated. In that case the solution occurs at least once, but not more than when not using the
modifier. Another two modifiers affect the number of returned solutions. The keyword LIMIT
defines how many solutions will be returned. The OFFSET clause determines the number of
solutions after which the required data will be returned. The combination of these two modifiers
returns a particular number of solutions starting at the defined point.
BASE <http://dbpedia.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpedia: <property/>
SELECT DISTINCT ?uniname ?countryname ?no_students ?no_staff ?headname
WHERE {
{
?uni dbpedia:type <http://dbpedia.org/resource/Public_university>.
?uni dbpedia:country ?country.
?country rdfs:label ?countryname.
?uni dbpedia:undergrad ?no_students.
?uni dbpedia:staff ?no_staff.
?uni rdfs:label ?uniname.
FILTER (xsd:integer(?no_staff) < 2000).
FILTER (regex(str(?country), "Scotland") || regex(str(?country),"England")).
FILTER (lang(?uniname)="en")
FILTER (lang(?countryname)="en")
}
OPTIONAL
{?uni dbpedia:head ?headname}
}
ORDER BY DESC(?no_students)
LIMIT 5
uniname countryname no students no staff headname
Napier University Scotland 11685 1648
University of the West of Scotland Scotland 11395 1300 Professor Bob Beaty
University of Stirling Scotland 6905 1872 Alan Simpson
Aston University England 6505 1,000+
Heriot-Watt University Scotland 5605 717 Gavin J Gemmell
Figure 2.10: SPARQL query presenting universities with its number of students, number of staff
and optional name of the headmaster with some filtering applied. Below are the results of the
query. Source: DBpedia (http://www.dbpedia.org), [20.04.2008]
Supporting only basic graph patterns in some cases might be a very serious limitation. SPARQL
provides mechanisms to combine a number of small patterns to obtain more complex set of triples.
The simplest one is a group graph pattern where all stated triple patterns have to match against
given RDF dataset. Group graph pattern is presented in Figure 2.8. A result of graph pattern match
can be modified using OPTIONAL clause. The RDF data model is a subject of constant change, so
39
SPARQL
assumption of full availability of desired information is too strict. Opposite to group graph pattern
matching OPTIONAL clause allows to extend the result set with additional information without
eliminating the whole solution if that particular information is inaccessible. When the optional
graph pattern does not match, the value is not returned and the binding remains empty. If there
is a need to present a result set that contains a set of alternative subgraphs, SPARQL provides a
way to match more than one independent graph pattern in one query. This is done by employing
UNION clause in the WHERE clause that joins alternative graph patterns. The result consists of the
sequence of solutions that match at least one graph pattern.
Finally, the SPARQL can restrict the source of the data that is being processed. RDF dataset always
consists of at least one RDF graph, which is a default graph and does not have any name. The
optional graphs are called named graphs and are identified by URI. SPARQL is usually querying
the whole RDF dataset, but scope of the dataset can be limited to a number of named graphs. The
specification of RDF dataset is set by URI using FROM clause, which indicates the active dataset.
The representation of the resource identified by URI should contain the required graph – this can
be e.g. a file with a RDF dataset or another SPARQL endpoint. If a combination of datasets is
referred to by the FROM keyword, the graphs are merged to form a default RDF graph. To query
a graph without adding it to the default dataset, the graph should be referred to by FROM NAMED
clause. In that case the relation between RDF dataset and named graph is indirect, named graph
remains independent to the default graph. To switch between the active graphs SPARQL uses the
GRAPH clause. Only triple patterns that are stated inside the clause are matched against the active
graph. Outside the clause, the triple patterns are matched against the default graph. GRAPH clause
is very powerful. It can be used not only to provide solutions from specific graphs, but is also very
useful for the right graph containing desired solution.
SPARQL is a technology that the whole community was waiting for. Its official specification
regulates the access to RDF datastores which will result in increased popularity of the whole
concept and cause SPARQL to be regarded as not just the technology for academia, but as the
stable solution that is worth implementing in common data access tools.
However the current specification of SPARQL does not fully met the requirements. The com-
munity has pointed out the lack of data modification functions as one of the most serious issues.
Another problem is an inability to use cursors caused by the stateless character of the proto-
40
SPARQL
col. SPARQL does not allow computing or aggregating results. This has to be done by external
modules. What is more, querying collections and containers may be complicated, which may be
especially inconvenient while processing OWL ontologies. Finally the lack of support for fulltext
searching is quite problematic.
Apart from that SPARQL is a significant step on the way to the Semantic Web, but also a starting
point for the research on the higher layers of the Semantic Web “layer cake” diagram. However
there is a place for improvement and further research. W3C should consider starting to work on
the next version of SPARQL Query Language.
2.6. Review of Literature about SPARQL
SPARQL Query Language for RDF is a relatively new technology. Indisputably it is gaining
popularity within the Semantic Web community, but there is still little research so far on the
language itself and its implementability. Google Scholar returns only 2030 search results for the
word “sparql”. This is almost nothing comparing to the number of search results when looking for
the word “rdf” – 237000, or documents related to “semantic web” – 34400028. Google Scholar is
not an objective source of knowledge – the number of results may vary depending on date and if the
local version of the search engine is used. However it shows how big is the difference in popularity
between stable RDF and brand-new SPARQL. What is more the number of publications where
SPARQL query language and the implementation issues are being under research is very small.
Usually SPARQL appears in the context of the complex architecture that is being implemented to
solve a particular problem with the means provided by the Semantic Web.
The first complete study of the requirements that semantic query language has to meet was done
in “Foundations of Semantic Web Databases” (Guti´errez et al. 2004). According to the paper,
the new features of RDF, like blank nodes, reification, redundancy and RDFS with its vocabulary,
need a new approach to queries in comparison to relational databases. The authors at the beginning
propose the notion of normal form for RDF graphs. The notion is a combination of core and closed
graphs. A core graph is one that cannot be mapped into itself. An RDFS vocabulary together with
all the triples it applies to is called a closed graph. The problem is the redundancy of triples. The
authors describe an algorithm that allows reduction of the graph. Even so computing the normal
28
The test was performed using http://scholar.google.pl on 6.05.2008.
41
SPARQL
and reduced forms of the graph is still very difficult. On that theoretical background a formal
definition of RDF query language is given. A query is a set of graphs considered within a set of
premises with some of the elements replaced by variables limited by a number of constraints. The
answer to a query is a separate and unique graph. A very important property that every query
language should have is the possibility to compose complex queries from the results of the simpler
ones (compositionality). A union or merge of single answers can achieve this. In the first case,
the existing blank nodes have unique names, while in merging the result sets the names of the
blank nodes have to be changed. The union operation is more straightforward and can create data
independent queries. The merge operator is more useful for querying several sources. Finally, the
authors discuss the complexity of answering queries.
Similar theoretical deliberations on semantic query language can be found in “Semantics and
Complexity of SPARQL” (Perez, Arenas & Gutierrez 2006a). However this time the authors start
from the RDF formalization done in Guti´errez et al. (2004) to examine the graph pattern facility
provided by SPARQL. Although the features of the SPARQL seem to be straightforward, in com-
bination they create increased complexity. According to the authors, SPARQL shares a number of
constructs with other semantic query languages. However, there was still a need to formalize the
semantics and syntax of SPARQL. The authors consider graph pattern matching facility limited
to one RDF data set. They start by defining the syntax of a graph pattern expression as a set of
graph patterns related to each other by 𝐴𝑁 𝐷, 𝑈 𝑁 𝐼𝑂𝑁, 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 operators and limited by
𝐹 𝐼𝐿𝑇 𝐸𝑅 expression. Then they define the semantics of the query language. It turns out that op-
erators 𝑈 𝑁 𝐼𝑂𝑁 and 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 makes the evaluation of the query more complex. There are
two approaches for computing answers to graph patterns. The first one uses operational seman-
tics what means that the graphs are matched one after another using intermediate results from the
preceding matchings to decrease the overall cost. The second approach is based on bottom up eval-
uation of the parse tree that minimizes the cost of the operation using relational algebra. Relational
algebra can be easily applied to SPARQL, however there are some discrepancies. The lack of con-
straints in SPARQL makes the 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 operator not fully equal to its relational counterpart
– left outer join. Further issues are null-rejecting relations, which are impossible in SPARQL, and
Cartesian product that is often used in SPARQL. Finally, the authors state the normal form of an
optional triple pattern that should be followed to design cost-effective queries. It assumes that
all patterns that are outside optional should be evaluated before matching the optional patterns.
42
SPARQL
Similar conclusion are drawn while evaluating graph patterns with relational algebra in Cyganiak
(2005b).
The authors of Perez et al. (2006a) continue their studies on semantics of SPARQL in “Semantics
of SPARQL” (Perez, Arenas & Gutierrez 2006b). The goal of this technical report was to update
the original publication with the changes introduced by W3C Working Draft published in October
2006. The authors extend the definitions of graph patterns stated in the previous paper and discuss
the support for blank nodes in graph patterns and bag/multiset semantics for solutions. At the
beginning, the authors state the basic definitions of RDF and basic graph patterns. Then they
define syntax and semantics for the general graph patterns. They also include the GRAPH operator,
which defines the default graph that is matched against the query. Another extension to Perez et al.
(2006a) is the semantics of query result forms. SELECT and CONSTRUCT clauses are also being
discussed. Finally, the definition of graph patterns is extended by the support for blank nodes and
bags. The main problem they indicate is the increased cardinality of the solutions. They finish
the report with two remarks about query entailment, which was not fully defined at the time of
writing.
The author of “A relational algebra for SPARQL” (Cyganiak 2005b) does not focus on the generic
definition of SPARQL queries. He transforms SPARQL into relational algebra, which is an in-
termediate language for the evaluation of queries that is widely used for analysing queries on the
relational model. Such an approach has significant advantages – it provides knowledge about query
optimization for SPARQL implementers, makes the SPARQL support in relational databases more
straightforward and simplifies the further analyses on the queries over distributed data sources. The
author presents only queries over basic graph. Some special cases are also considered, however
the filtering operator still has to be put under research.
At the beginning author assumes that RDF graph can be presented as a relational table with 3
columns corresponding to ?𝑠𝑢𝑏𝑗𝑒𝑐𝑡, ?𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒 and ?𝑜𝑏𝑗𝑒𝑐𝑡. Each triple is stored as a separate
record. There is also a new term introduced. An RDF tuple, which example is presented in
Figure 2.11, is a container that maps a number of variables to RDF terms and is also known as
RDF solution. Tuple is an universal term used in relational algebra. Every variable present in a
tuple is said to be bound. A set of tuples forms an RDF relation. The relations can be transformed
into triples and form a data set.
43
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation

Contenu connexe

Tendances

Sqlmap readme
Sqlmap readmeSqlmap readme
Sqlmap readmefangjiafu
 
Getting Started with OpenStack and VMware vSphere
Getting Started with OpenStack and VMware vSphereGetting Started with OpenStack and VMware vSphere
Getting Started with OpenStack and VMware vSphereEMC
 
Hibernate Reference
Hibernate ReferenceHibernate Reference
Hibernate ReferenceSyed Shahul
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media LayerLinkedTV
 
Introduction to system_administration
Introduction to system_administrationIntroduction to system_administration
Introduction to system_administrationmeoconhs2612
 
Creating a VMware Software-Defined Data Center Reference Architecture
Creating a VMware Software-Defined Data Center Reference Architecture Creating a VMware Software-Defined Data Center Reference Architecture
Creating a VMware Software-Defined Data Center Reference Architecture EMC
 
Net app v-c_tech_report_3785
Net app v-c_tech_report_3785Net app v-c_tech_report_3785
Net app v-c_tech_report_3785ReadWrite
 
Vmw vsphere-high-availability
Vmw vsphere-high-availabilityVmw vsphere-high-availability
Vmw vsphere-high-availability선중 한
 
Sataid manual
Sataid manualSataid manual
Sataid manualJMA_447
 
labview-graphical-programming-course-4.6.pdf
labview-graphical-programming-course-4.6.pdflabview-graphical-programming-course-4.6.pdf
labview-graphical-programming-course-4.6.pdfNadia Fezai
 
Quick testprofessional book_preview
Quick testprofessional book_previewQuick testprofessional book_preview
Quick testprofessional book_previewSaurabh Singh
 

Tendances (20)

Sqlmap readme
Sqlmap readmeSqlmap readme
Sqlmap readme
 
Getting Started with OpenStack and VMware vSphere
Getting Started with OpenStack and VMware vSphereGetting Started with OpenStack and VMware vSphere
Getting Started with OpenStack and VMware vSphere
 
St3300655 lw
St3300655 lwSt3300655 lw
St3300655 lw
 
Java web programming
Java web programmingJava web programming
Java web programming
 
Hibernate Reference
Hibernate ReferenceHibernate Reference
Hibernate Reference
 
Hibernate reference
Hibernate referenceHibernate reference
Hibernate reference
 
Hibernate Reference
Hibernate ReferenceHibernate Reference
Hibernate Reference
 
Cluster administration rh
Cluster administration rhCluster administration rh
Cluster administration rh
 
C++programming howto
C++programming howtoC++programming howto
C++programming howto
 
Odoo development
Odoo developmentOdoo development
Odoo development
 
Qtp user-guide
Qtp user-guideQtp user-guide
Qtp user-guide
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 
Introduction to system_administration
Introduction to system_administrationIntroduction to system_administration
Introduction to system_administration
 
Creating a VMware Software-Defined Data Center Reference Architecture
Creating a VMware Software-Defined Data Center Reference Architecture Creating a VMware Software-Defined Data Center Reference Architecture
Creating a VMware Software-Defined Data Center Reference Architecture
 
Net app v-c_tech_report_3785
Net app v-c_tech_report_3785Net app v-c_tech_report_3785
Net app v-c_tech_report_3785
 
Vmw vsphere-high-availability
Vmw vsphere-high-availabilityVmw vsphere-high-availability
Vmw vsphere-high-availability
 
Manual
ManualManual
Manual
 
Sataid manual
Sataid manualSataid manual
Sataid manual
 
labview-graphical-programming-course-4.6.pdf
labview-graphical-programming-course-4.6.pdflabview-graphical-programming-course-4.6.pdf
labview-graphical-programming-course-4.6.pdf
 
Quick testprofessional book_preview
Quick testprofessional book_previewQuick testprofessional book_preview
Quick testprofessional book_preview
 

Similaire à Rafal_Malanij_MSc_Dissertation

java web_programming
java web_programmingjava web_programming
java web_programmingbachector
 
Composition of Semantic Geo Services
Composition of Semantic Geo ServicesComposition of Semantic Geo Services
Composition of Semantic Geo ServicesFelipe Diniz
 
452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdfkalelboss
 
Operating Systems (printouts)
Operating Systems (printouts)Operating Systems (printouts)
Operating Systems (printouts)wx672
 
Uni cambridge
Uni cambridgeUni cambridge
Uni cambridgeN/A
 
Postgresql 8.4.0-us
Postgresql 8.4.0-usPostgresql 8.4.0-us
Postgresql 8.4.0-usJoy Cuerquis
 
Cenet-- capability enabled networking: towards least-privileged networking
Cenet-- capability enabled networking: towards least-privileged networkingCenet-- capability enabled networking: towards least-privileged networking
Cenet-- capability enabled networking: towards least-privileged networkingJithu Joseph
 
html-css-bootstrap-javascript-and-jquery
html-css-bootstrap-javascript-and-jqueryhtml-css-bootstrap-javascript-and-jquery
html-css-bootstrap-javascript-and-jqueryMD. NURUL ISLAM
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1Federico Campoli
 

Similaire à Rafal_Malanij_MSc_Dissertation (20)

java web_programming
java web_programmingjava web_programming
java web_programming
 
Openstack InstallGuide.pdf
Openstack InstallGuide.pdfOpenstack InstallGuide.pdf
Openstack InstallGuide.pdf
 
Perltut
PerltutPerltut
Perltut
 
Composition of Semantic Geo Services
Composition of Semantic Geo ServicesComposition of Semantic Geo Services
Composition of Semantic Geo Services
 
Fraser_William
Fraser_WilliamFraser_William
Fraser_William
 
Latex2e
Latex2eLatex2e
Latex2e
 
452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf452042223-Modern-Fortran-in-practice-pdf.pdf
452042223-Modern-Fortran-in-practice-pdf.pdf
 
Ns doc
Ns docNs doc
Ns doc
 
MSc_Dissertation
MSc_DissertationMSc_Dissertation
MSc_Dissertation
 
Software guide 3.20.0
Software guide 3.20.0Software guide 3.20.0
Software guide 3.20.0
 
Operating Systems (printouts)
Operating Systems (printouts)Operating Systems (printouts)
Operating Systems (printouts)
 
Uni cambridge
Uni cambridgeUni cambridge
Uni cambridge
 
Postgresql 8.4.0-us
Postgresql 8.4.0-usPostgresql 8.4.0-us
Postgresql 8.4.0-us
 
Cenet-- capability enabled networking: towards least-privileged networking
Cenet-- capability enabled networking: towards least-privileged networkingCenet-- capability enabled networking: towards least-privileged networking
Cenet-- capability enabled networking: towards least-privileged networking
 
html-css-bootstrap-javascript-and-jquery
html-css-bootstrap-javascript-and-jqueryhtml-css-bootstrap-javascript-and-jquery
html-css-bootstrap-javascript-and-jquery
 
Liebman_Thesis.pdf
Liebman_Thesis.pdfLiebman_Thesis.pdf
Liebman_Thesis.pdf
 
Manual flacs
Manual flacsManual flacs
Manual flacs
 
Codeconventions 150003
Codeconventions 150003Codeconventions 150003
Codeconventions 150003
 
10.1.1.652.4894
10.1.1.652.489410.1.1.652.4894
10.1.1.652.4894
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
 

Rafal_Malanij_MSc_Dissertation

  • 1. Semantic Web: Comparison of SPARQL implementations Rafał Małanij Mat.No: B0105363 Thesis Project for the partial fulfilment of the requirements for the Master Degree in Advanced Computer Systems Development. University of The West of Scotland School of Computing 29th September 2008
  • 2. Abstract The Semantic Web is the revolutionary approach to publishing data in the Internet proposed years ago by Tim Berners-Lee. Unfortunately the deployment of the idea became more complex than it was assumed. Although the data model for the concept is well established recently a query language has been announced. The specification of SPARQL was a milestone on the way to fulfil the vision, but the implementation attempts show that there is a need for further research in the area. Some of the products are already available. This thesis is evaluating five of them using the data set based on DBpedia.org. Firstly each of the packages is described taking into consideration the documentation, the architecture and usability. The second part is testing the ability to load efficiently a significant amount of data and afterwards to compute in reasonable time results of the sample queries, which includes the most important structures of the language. The conclusion shows that although some of the packages seem to be very advanced and complex products, they still have some problems with processing queries based on basic specification. The Semantic Web and its key technologies are very promising, but they need some more stable implementations to become popular. 1
  • 3. CONTENTS Contents Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1. Semantic Web. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.1. Origins of the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2. From the Web of documents to the Web of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3. World Wide Web model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4. The Semantic Web’s Foundations – the Layer Cake . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5. The Semantic Web – Today and in the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2. SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1. RDF – data model for Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2. Querying the Semantic Web. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.1. Semantic Web as a distributed database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.2. Semantic Web queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3. The SPARQL query language for RDF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4. Implementation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5. SPARQL’s syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6. Review of Literature about SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3. The implementations of SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1. Testing methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.1. DBpedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.2. Ontology and test queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2. OpenRDF Sesame 2.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2
  • 4. CONTENTS 3.2.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.2.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.3. OpenLink Virtuoso 5.0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.3.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.3.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.3.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.4. Jena Semantic Web Framework 2.5.5 with ARQ 2.2, SDB 1.1 and Joseki 3.2. . . . . . 93 3.4.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.4.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.4.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.4.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.4.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.5. Pyrrho DBMS 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.5.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.5.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.5.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.5.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.5.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.6. AllegroGraph RDFStore 3.0.1 Lisp Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.6.1. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.6.2. Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 3.6.3. Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3.6.4. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 3.6.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3
  • 5. LIST OF FIGURES List of Figures 1.1. W3C’s Semantic Web Logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2. Semantic Web’s “layer cake” diagram Source: http://www.w3.org/2007/03/layerCake.png, [12.02.2008]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1. Structure of RDF triple, after Passin (2004). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2. RDF statements. Source: DBpedia (http://www.dbpedia.org), RDF/XML vali- dated by http://www.rdfabout.com/demo/validator/validate.xpd, [12.03.2008] . . . . 22 2.3. RDF graph. Based on: DBpedia (http://www.dbpedia.org), [12.03.2008] . . . . . . . . . 24 2.4. RDF statements in Turtle syntax. Source: DBpedia (http://www.dbpedia.org), [12.03.2008] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5. The history of SPARQL’s specification. Based on SPARQL Query Language for RDF (2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6. SPARQL implementation model. Source: Herman (2007a) . . . . . . . . . . . . . . . . . . . . . 32 2.7. The process of transforming calendar data from XHTML extended by hCalendar microformat into RDF triples. Source: GRDDL Primer (2007). . . . . . . . . . . . . . . . . . 35 2.8. Simple SPARQL query with the result. Source: DBpedia (http://www.dbpedia.org), [12.04.2008] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.9. Application of CONSTRUCT query result form with the results of the query seri- alized in Turtle syntax. Source: DBpedia (http://www.dbpedia.org), [12.04.2008] . . 38 2.10. SPARQL query presenting universities with its number of students, number of staff and optional name of the headmaster with some filtering applied. Below are the results of the query. Source: DBpedia (http://www.dbpedia.org), [20.04.2008] . 39 2.11. Structure of RDF tuple, after Cyganiak (2005b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.12. Selection (𝜎) and projection (𝜋) operators, after Cyganiak (2005b). . . . . . . . . . . . . . . 44 4
  • 6. LIST OF FIGURES 2.13. SPARQL query transformed into relational algebra tree, after Cyganiak (2005b). . . 45 3.1. The status of datasets interlinked by the Linking Open Data project. Source: http://richard.cyganiak.de/2007/10/lod/lod-datasets/, [12.06.2008]. . . . . . . . . . . . . . . 57 3.2. Querying on-line DBpedia SPARQL endpoint with Twinkle. . . . . . . . . . . . . . . . . . . . 61 3.3. Query testing full-text searching capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4. Selective query with UNION clause. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5. Query with numerous selective joins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6. Query with nested OPTIONAL clauses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7. CONSTRUCT clause creating new graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.8. ASK query that evaluates the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.9. Query returning all available triples for the particular resource. . . . . . . . . . . . . . . . . . 65 3.10. Two versions of GRAPH queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.11. Architecture of Sesame. Source: User Guide for Sesame 2.1 (2008). . . . . . . . . . . . . . 68 3.12. The interface of Sesame Server.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.13. Sesame Console with a list of available repositories. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.14. Sesame Workbench – exploring the resources in the repository based on a native storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.15. Graph comparing loading times for OpenRDF Sesame using different storages. . . . 76 3.16. Graph comparing execution times of testing queries against different repositories. . 79 3.17. Architecture of Virtuso Universal Server. Source: Openlink Software (2008). . . . . . . 83 3.18. OpenLink Virtuoso Conductor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.19. OpenLink Virtuoso’s SPARQL endpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.20. Interactive SPARQL endpoint with visualisation of one of the test queries. . . . . . . . . 87 3.21. Architecture of Jena Semantic Web Framework version 2.5.5. Source: Wilkinson, Sayers, Kuno & Reynolds (2004). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.22. Graph comparing loading times for SDB using different backened. . . . . . . . . . . . . . . 99 3.23. Graph comparing average loading times for SDB using different backened. . . . . . . . 103 3.24. Querying SDB repository using command line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.25. Joseki’s SPARQL endpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.26. Architecture of Pyrrho DB. Source: Crowe (2007). . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.27. Evaluation of the first test query against Pyrrho DBMS using provided RDF client. 113 5
  • 7. LIST OF FIGURES 3.28. Pyrrho Database Manager showing local database sparql with the data stored in Rdf$ table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.29. High-level class diagram of AllegroGraph. Source: AllegroGraph RDFStore (2008).119 3.30. The process of loading AllegroGraph server and querying a repository using Alle- gro CL environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.31. Graph comparing average loading times the best performing configurations. . . . . . . 133 6
  • 8. LIST OF TABLES List of Tables 3.1. Summary of loading data into OpenRDF Sesame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2. Summary of evaluating test queries on OpenRDF Sesame. . . . . . . . . . . . . . . . . . . . . . 78 3.3. Summary of loading data into OpenLink Virtuoso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.4. Summary of evaluating test queries on OpenLink Virtuoso. . . . . . . . . . . . . . . . . . . . . 90 3.5. Summary of loading data using SDB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.6. Summary of evaluating test queries on repositories managed by SDB.. . . . . . . . . . . . 106 3.7. Summary of evaluating test queries against Pyrrho Professional. . . . . . . . . . . . . . . . . 116 3.8. Summary of loading data into AllegroGraph repository. . . . . . . . . . . . . . . . . . . . . . . . 123 3.9. Summary of evaluating test queries on AllegroGraph RDFStore. . . . . . . . . . . . . . . . . 125 3.10. Summary of loading data into tested implementations – configurations that had the best performance for each implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.11. Summary of performing test queries – configurations that had the best perfor- mance for each implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7
  • 9. INTRODUCTION Introduction In the late 1980’s the Internet was becoming internationally established. However retrieving in- formation from remote computer systems was a challenge due to the lack of unified protocol for accessing information. In the same time Tim Berners-Lee, a physicist in CERN Laboratory in Switzerland, started to work on a protocol that would allow easier access to information distributed over many computers. In 1989, with help from Robert Cailliau, Tim Berners-Lee published a pro- posal for the new service - World Wide Web. That was the beginning of the revolution. Within a few years WWW became the most popular service in the Internet. In 1994 Tim Berners-Lee launched a World Wide Web Consortium (W3C) that started to work on standardising the technologies that were to extend the functionality of WWW. That was the time when webpages became dynamic, but the “golden years” were to come soon. WWW was spotted by the business community and the revolution was spread around the world. Now we can truly say that hyperlinks have revolutionised our life - the way we publish infor- mation, media, the way we buy and sell goods, the way we communicate. Almost everybody in developed countries has personalised email address and treats the Internet as regular tool that helps in everyday life. We can undoubtedly agree that the Internet is one of the pillars of the revolution that is transforming the developed world into a knowledge-driven society. However some visionaries claim that this is not yet the Web of data and information. The meaning of today’s Web content is only accessible for humans. Although search engines have become very powerful tools, the quality of the search results is relatively low. What is more, the results contains only links to webpages, where possibly the information may be found. Users still play the main role in processing information published in the Internet. Tim Berners-Lee was aware of all the imperfections of the Web. In the end of the 1990’s he pro- posed the extension to the current Web that he called the Semantic Web. The specialists announced 8
  • 10. INTRODUCTION a revolution – Web 3.0. However the implementation of that vision turned out to be more complex than expected. The revolution was replaced by evolution. In this thesis I will focus on one of the aspects of Semantic Web – handling semantic data. Firstly the vision of the Semantic Web along with basic technologies will be presented. Then I will examine what expectations derive from the Semantic Web’s foundation for the technologies that will be responsible for accessing data on the Web. In the following chapter the W3C’s approach, SPARQL query language, will be presented together with a short introduction into semantic data model and the problem of querying the Semantic Web. SPARQL will be discussed in details including the syntax, the implementation models and a review of available literature about the technology. The practical part of the research will involve a review of a number of available implementations of SPARQL, which are going to be a subject of some basic usability tests. Firstly the methodology will be presented together with a description of the data set used for testing. Then each of examined implementations will be reviewed and tested presenting the findings. Finally the implementations will be compared when possible and some conclusion will be drawn. 9
  • 11. SEMANTIC WEB 1. Semantic Web “The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” (Berners-Lee, Hendler & Lassila 2001) 1.1. Origins of the Semantic Web The above quotation comes from one of the best known articles about the Semantic Web1 – “The Semantic Web” published in the year 2001 in Scientific American. It is considered as the initia- tor of the “semantic revolution” in IT. In fact, due to its popularity, a worldwide discussion has emerged and some implementation efforts have commenced, but the first ideas were presented by Tim Berners-Lee earlier in his book, “Weaving the Web: Origins and Future of the World Wide Web” (Berners-Lee & Fischetti 1999). Figure 1.1: W3C’s Semantic Web Logo From the very beginning he was thinking about the Web as the universal network, where docu- ments will be connected to each other by their meaning in a way that enables automatic process- ing of information. In “Weaving the Web” he summarised not only his work on developing the Web in the current form, but he was also try- ing to answer the questions about the future of the Web. 1 Google Scholar finds it cited in 5304 articles what gives it a first place for searching phrase “semantic web”. Source: http://scholar.google.co.uk/scholar?hl=en&lr=&q=semantic++web&btnG=Search. Retrieved on 2008.01.29. 10
  • 12. SEMANTIC WEB Even before his article in Scientific American, Tim Berners-Lee and scientists gathered around the World Wide Web Consortium (W3C) started to work on technologies that will form the basis for the Semantic Web in the future2. They were presenting the vision in numerous lectures around the world and supporting initiatives for deploying these technologies in some specific knowledge areas. The first document, “Semantic Web Roadmap” (Berners-Lee 1998), where ideas about the architecture were described, was published in September 1998. 1.2. From the Web of documents to the Web of data The word “semantics”, according to Encyclopedia Britannica Online3, means “the philosophical and scientific study of meaning”. The keyword is the word “meaning”. The current version of the Web, that was implemented in 1990’s, is based on the mechanism of linking between documents published on web servers. However despite its universality, the mech- anism of hyperlinks does not allow a transfer of the meaning of the content between applications. That inability prevents computers from using the Web content to automate everyday activities. Computers just do not understand the information they are processing and displaying so human involvement is needed to put the information into context and thus exchange semantics between the systems. That problem also occurs while exchanging data between the computer systems used in business. Different standards of storing data in applications require the use of custom-built parsers – this increases costs and complexity or may lead to many extraction errors and data inconsistency. The Semantic Web vision envisages that computers should be able to search, understand and use the information they process with a little help from additional data. However there are different ideas what that vision involves. Passin (2004, p.3) states 8 of them. The most important from the perspective of that thesis is the vision of the Semantic Web as a distributed database. According to Berners-Lee, Karger, Stein, Swick & Weitzner (2000), cited in Passin (2004), the Semantic Web is about to present all the databases and logic rules allowing them to interconnect and create a large database. Information should be easily accessed, linked and understood by computers. 2 First working draft of RDF specification was published in October 1997. RDF Model and Syntax specification was released as W3C Recommendation a year later, in February 1999. 3 Encyclopedia Brytannica Online, http://www.britannica.com/eb/article-9110293/semantics. Retrieved on 2008.01.29. 11
  • 13. SEMANTIC WEB Data should be connected by relations to its meaning. That goal can be achieved by extending the existing databases by additional descriptions of data, usually called meta data. That supplementary information enables advanced indexing and discov- ery of decentralised information. Moreover, searching and retrieval of information will be auto- mated by software agents. These are dedicated applications that communicate with other services and agents on the Web, and with the help of artificial intelligence can provide improved results or even follow certain deduction processes. The machine-readable data will be accessible as services over the Web that will allow computers to discover and process easily all the required information. What is more the great amount of data that is available outside databases, e.g. static webpages, will be understandable by machines due to semantic annotations and defined vocabularies. 1.3. World Wide Web model Today’s model of the World Wide Web is based on a few simple principles. The most basic one assumes that when a Web document links to another, the linked document can be considered as a resource. In the Semantic Web, resources are identified using unique Uniform Resource Identifier (URI). In the current Web, resources such as files or web pages are identified by standardised Uniform Resource Locators (URLs), which are a kind of URIs, but extended with the description of its primary access method (e.g. http:// or ftp://). The concept of URI says that resources may represent tangible things like files and non-tangible ideas or concepts, which even does not have to exist, but can be thought about. What is more, the resources can be fixed or change constantly and they are still represented by the same URI. Over the Web the messages are being sent using the HTTP protocol4, which consists of a small set of commands and makes it easy to implement in all kind of network software (web servers, browsers). Although some extensions, like cookies or SSL/TLS encryption layer, are being used, the original version of protocol does not support security or transaction processing. Another principle of the WWW is its decentralisation and scalability. Every computer connected to the Internet can host a web server, and this makes the Web easily extendible. There is no central 4 Hypertext Transfer Protocol (HTTP) – communication protocol used to transfer information between client and server deployed in application layer (according to TCP/IP model). It was originally proposed by Tim Berners-Lee in 1989. 12
  • 14. SEMANTIC WEB authority that maintains the infrastructure. What is more, every request from client to server is treated independently. The HTTP protocol is stateless, and this makes it possible to cache the responses and decrease network traffic. The Web is open – resources can be added freely. It is also incomplete, and this means that there is no guarantee that every resource is always accessible. That implies the next attribute – inconsistency. The information published on-line does not have to be always true. It is possible that two resources can easily deny each other. Resources are also constantly changing. Due to the features of HTTP protocol and utilisation of caching servers it may happen that there are two different versions of the same resource. These aspect raise a very serious requirement on software agents that attempt to draw conclusions from data found on the Web. 1.4. The Semantic Web’s Foundations – the Layer Cake The Semantic Web, as an extension of the current Web, should follow the same rules as the current model. According to that all resources should use URIs to represent objects. The Semantic Web refers also to non-addressable resources that cannot be transferred via the network. Currently that feature was not used as the most popular URIs – URLs, were referring to tangible documents. The basic protocol should continue to have a small set of commands and retain no state information. It should remain decentralised, global and operate with inconsistent and incomplete information with all the advantages of caching of information. The W3C, as the main organisation that is developing and promoting standards for the Seman- tic Web, has created their own approach to its architecture. The first overview was presented in Berners-Lee (1998) and it has been evolving together with the evolution and development of the technologies involved. W3C published a diagram presenting the structure and dependencies be- tween them. All the technologies are shown as layers where higher ones depend on underlying technologies. Each layer is specialised and tends to be more complex than the layers below. How- ever they can be developed and deployed relatively independently. The diagram is known as the “Semantic Web layer cake”. Description of the layers depicted in Figure 1.4 are as follows: ∙ URI/IRI — According to the Semantic Web vision all the resources should have their identi- 13
  • 15. SEMANTIC WEB Figure 1.2: Semantic Web’s “layer cake” diagram Source: http://www.w3.org/2007/03/layerCake.png, [12.02.2008] fiers encoded using URIs. The Internationalized Resource Identifier (IRI) is a generalisation of URI extended by support for Universal Character Set (Unicode/ISO 10646). ∙ Extensible Markup Language (XML) — General-purpose markup language that allows to encode user-defined structures of data. In the Semantic Web XML is used as a framework to encode data but provides no semantic constraints on its meaning. XML Schema is used to specify the structure and data types used in particular XML documents. XML is a stable technology commonly used for exchanging data. It became a W3C Recommendation in February 1998. ∙ Resource Description Framework (RDF) — a flexible language capable of describing data and meta data. It is used to encode a data model of resources and relations between them using XML syntax. RDF was introduced as a W3C Recommendation a year later than XML, in February 1999. Semantic data models can be also serialized in alternative notations like Turtle, N-Triples or TriX. ∙ RDF Schema (RDFS) — Used as a framework for specifying basic vocabularies in RDF documents. RDFS is built on top of RDF that extends it by a few additional classes describ- ing relations and properties between resources. 14
  • 16. SEMANTIC WEB ∙ Rule: Rule Interchange Format (RIF) — It is a family of rule languages that are used for exchanging rules between different rule-based systems. Each RIF language is called a “di- alect” to facilitate the use of the same syntax for similar semantics. Rules exchanged by using RIF may depend on or can be used together with RDF and RDF Schema or OWL data models. RIF is a relatively new initiative: the W3C’s RIF Working Group was formed in November 2005 and first working drafts were published on 30 November 2007. ∙ Query: SPARQL — A query language designed for RDF that also includes specification for accessing data (SPARQL Protocol) and representing the results of SPARQL queries (SPARQL Query Results XML Format). ∙ Ontology: Web Ontology Language (OWL) — Used to define vocabularies and to specify the relations between words and terms in particular vocabularies. RDF Schema can be employed to construct simple ontologies. However OWL was the language designed to support advanced knowledge representation in the Semantic Web. OWL is a family of 3 sublanguages: OWL-DL and OWL-Lite based on Description Logics and OWL-Full, which is a complete language. All three languages are popular and used in many implementations. OWL became a W3C Recommendation in February 2004. ∙ Logic — Logical reasoning draws conclusions from a set of data. It is responsible for apply- ing and evaluating rules, inferring facts that are not explicitly stated, detecting contradictory statements and combining information from distributed sources. It plays a key role in gath- ering information in the Semantic Web ∙ Proof — Used for explaining inference steps. It can trace the way the automated reasoner deducts conclusions, validate it and, if needed, adjust the parameters. ∙ Trust — Responsible for authentication of services and agents together with providing ev- idence for the reliability of data. This is a very important layer as the Semantic Web will achieve its full potential only when there is a trust in its operations and the quality of data. ∙ Crypto — Involves the deployment of Public Key Infrastructure, which can be used to au- thenticate documents with digital signature. It is also responsible for secure transfer of information. 15
  • 17. SEMANTIC WEB ∙ User Interface and Applications — This layer encompasses tools like personal software agents that will interact with end-users and the Semantic Web together with Semantic Web Services, which are able to communicate between each other to exchange data and provide value for the users. The diagram in Figure 1.4 presents the most recent version of the architecture. The original archi- tecture was single-stacked – the layers were placed one after another (except the security layer). However the years of research on the particular technologies has shown that it is impossible to separate the layers. Kifer, de Bruijn, Boley & Fensel (2005) discuss the interferences between technologies also taking into consideration the technologies that were not developed by W3C (e.g. SWRL5, SHOE6). The conclusion is that the multi-stack architecture is a better way of show- ing the different features of the technological basis for the rule and ontology layers. Antoniou & van Harmelen (2004, p.17) suggest that two principles should be followed when considering the diagram: downward compatibility and upward partial understanding. The first one assumes that applications operating on certain layers should be aware and able to use the information written at lower levels. Upward partial understanding says that applications should at least partially take advantage of information available at higher layers. 1.5. The Semantic Web – Today and in the Future Although the Semantic Web has strong foundations in research results, not all of the technologies presented in Figure 1.4 are yet developed and implemented. Only RDF(S)/XML and OWL stan- dards are stable and implementations are available. SPARQL and RIF have appeared quite recently and the implementations are in development phase. The higher layers are still under research. The existing technologies are becoming popular. There are many tutorials and books that explain 5 Semantic Web Rule Language (SWRL) – proposal for Semantic Web rules interchange language that combines simplified OWL Web Ontology Language (OWL DL and OWL Lite) with RuleML. The specification was created by National Research Council of Canada, Network Inference and Stanford University and submitted to W3C in May 2004. Source: http://www.w3.org/Submission/SWRL/. Retrieved on: 16.02.2008. 6 Simple HTML Ontology Extension (SHOE) – small extension to HTML that allows to include machine- processable meta data in static webpages. SHOE was developed around 1996 by James Handler and Jeff Heflin. Source: http://www.cs.umd.edu/projects/plus/SHOE/. Retrieved on: 16.02.2008. 16
  • 18. SEMANTIC WEB how to deploy the RDF or create ontologies. Developers are working within active communities (e.g. http://www.semanticweb.org/). There are many implementations that support the RDF model including editors, stores for datasets and programming environments7. Some of them are commer- cial products (e.g. Siderean’s Seamark Navigator used by Oracle Technology Network portal8), some are being developed by Open Source communities, e.g. Sesame. Also a number of vocabularies and ontologies have been developed. Very popular vocabularies are Dublin Core9 and Friend of a Friend10, which were created by non-commercial initiatives11. Health care and life sciences is a sector where the need for integrating diverse and heterogeneous datasets evoked the creation of the first large ontologies, e.g. GeneOntology12 that describes genes and gene product attributes or The Protein Ontology Project13 that classifies a knowledge about proteins. Other disciplines are also developing their ontologies, like eClassOwl14 that classifies and describes products and services for e-business or WordNet15 – a semantic lexicon for English language. We can find ontologies that integrate data from environmental sciences (e.g. climatol- ogy, hydrology, oceanography) or are deployed in a number of e-government initiatives16. Another source of meta data has arisen along with Web 2.0 portals known as social software. The commu- nities of contributors (folksonomies) interested in particular information, describe it with tags or keywords and publish it on-line. Although tagging offers a significant amount of structured data it is being developed to meet different goals than ontologies, which are defining data more carefully, taking into consideration relations and interactions between datasets. Despite its wider adoption, the OWL family needs more reliable tools that support modelling and application of ontologies that might be used by non-technical users. On the other hand we cannot just choose any URI and search existing data stores – the data exposure revolution has not yet happened (Shadbolt, Berners-Lee & Hall 2006). 7 The list of all implementations is available on W3C Wiki – http://esw.w3.org/topic/SemanticWebTools. 8 Source: OTN Semantic Web (Beta), http://www.oracle.com/technology/otnsemanticweb/index.html, 2008.02.25. 9 Dublin Core Metadata Initiative, http://www.dublincore.org/ 10 The Friend of a Friend (FOAF) project, http://www.foaf-project.org/ 11 There are webpages where available vocabularies are listed, e.g. SchemaWeb (http://www.schemaweb.info/). 12 GeneOntology, http://www.geneontology.org/ 13 The Protein Ontology Project, http://proteinontology.info/ 14 eClassOwl, http://www.heppnetz.de/projects/eclassowl/ 15 WordNet, http://wordnet.princeton.edu/ 16 Integrated Public Sector Vocabulary was created in United Kingdom, http://www.esd.org.uk/standards/ipsv. Re- trieved on 1.03.2008. 17
  • 19. SEMANTIC WEB According to Herman (2007b) the Semantic Web, once only of interest of academia, has been already spotted by small businesses and start-ups. Now the idea is becoming attractive to large corporations and administration. Major companies offer tools or systems based on the Semantic Web concept. Adobe has created a labelling technology that allows meta data to be added to most of their file formats17. Oracle Corporation is not only supporting RDF in their products but is also using RDF as a base for their Press Room18. The number of companies that are participating in W3C Semantic Web Working Groups is increasing. Corporate Semantic Web was chosen by Gart- ner in 2006 as the top emerging technology that will improve the quality of content management, system interoperability and information access. They predict that it will take 5 to 10 years for Semantic Web technology to become reliable (Espiner 2006). Although RDF and OWL are gaining popularity there is some criticism around these technologies. It is unclear how to extract RDF data from relational databases. It is possible to do it semi- automatically, but current mechanisms still require a huge amount of data to be manually corrected. Also there will be an increase in costs of preparing data if it has to be published in format accessible for machines (RDF) and adjusted for humans to read. The XML syntax of RDF itself is not human- friendly. To overcome that problem the GRDDL19 mechanism was created. It potentially allows binding between XHTML and RDF with the use of XSLT. Another concern is about censorship, as semantic data will be easily accessible, it will be also easy to filter data or block it thoroughly. Authorities may control the creation and viewing of controver- sial information as its meaning will be more accessible for automated content-blocking systems. Also the popularity of FOAF profiles with geo-localisation will decrease users’ anonymity. There is still a need to develop and standardize functionalities like simpler ontologies, the support for fuzzy logic and rule based reasoning. There are some initiatives like RIF to regulate auto- mated reasoning, but there is a lack of standards in that field. Different knowledge domains are implementing different approaches to inference – the most suitable in particular cases. Also the shape of the layers responsible for trust, proof and cryptography still remains a puzzle. Developing 17 Extensible Metadata Platform (XMP) is supported by major Adobe’s products like Adobe Acrobat, Adobe Photo- shop or Adobe Illustrator. Adobe has also published a toolkit that allows integrating XMP into other applications. XMP Toolkit is available under the BSD licence. Source: http://www.adobe.com/products/xmp/index.html 18 Oracle Press Releases, http://pressroom.oracle.com/ 19 Gleaning Resource Descriptions from Dialects of Language (GRDDL), became a W3C Recommendation on 11.09.2007, http://www.w3.org/TR/grddl/. Retrieved on 1.03.2007. 18
  • 20. SEMANTIC WEB ontologies is an additional challenge as interoperability, merging and versioning remains unclear. Antoniou & van Harmelen (2004, p.225) finds the problem with ontology mapping as probably the most complicated, as there is no central control over application of standards and technologies during modelling ontologies in open Semantic Web environment. The Semantic Web vision itself was also criticised. Even Tim Berners-Lee recently said that even though the idea is simple, it still remains unrealized (Shadbolt et al. 2006). Walton (2006, p.109) raises the layered model for discussion as the present shape imply certain difficulties for the design of software agents – providing a unified view of independent layers might be a challenge. The Semantic Web, like the current Web, relies on the principle that people provide reliable con- tent. Other important aspects are the fundamental design decisions and their consequences in creating and deploying standards. Both are being fulfilled – particular communities are working on RDF datasets and there is a broad discussion about each of the layers of the Semantic Web focused around W3C Working Groups. As Shadbolt et al. (2006) says, the Semantic Web con- tributes to Web Science, a science that is concerned with distributed information systems operating on global scale. It is being encouraged by the achievements of Artificial Intelligence, data mining and knowledge management. 19
  • 21. SPARQL 2. SPARQL 2.1. RDF – data model for Semantic Web The vision of the Semantic Web required new approach to handling data and metadata while it came to applications. To meet the expectations, W3C in October 1997 published a working draft for a new universal language to form a basis for the Semantic Web. The Resource Description Framework (RDF) is providing a standard way to describe, model and exchange information about resources. It was created as a high-level language and thanks to its low expressiveness, the data is more reusable. RDF Model and Syntax Specification became W3C recommendation in February 1999. The current version of the specification was published in February 2004. The RDF is in fact a data model encoded with XML-based syntax. It provides a simple mechanism to make statements about resources. RDF has a formal semantics that is the basis for reasoning about the meaning of an RDF dataset. The RDF statements are usually called triples as they consist of three elements: subject (re- source), predicate (property) and object (value). The triples are similar to simple sentences with subject-verb-object structure. The structure of an RDF triple can be represented as a logical for- mula 𝑃(𝑥, 𝑦) where binary predicate 𝑃 relates object 𝑥 to object 𝑦. Figure 2.1 depicts its struc- ture (Passin 2004). ( 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 ⏞ ⏟ town1 , 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒 ⏞ ⏟ name , 𝑜𝑏𝑗𝑒𝑐𝑡 ⏞ ⏟ ”Paisley” ) ⏟ ⏞ 𝑡𝑟𝑖𝑝𝑙𝑒 Figure 2.1: Structure of RDF triple, after Passin (2004). The subject of a triple is a resource identified by an URI. An URI reference is usually presented 20
  • 22. SPARQL in URL style extended by fragment identifier – the part of the URI that follows “#”1. A fragment identifier relates to some portion of the resource. Also different URI schemes and its variations are allowed, however the generic syntax has to remain as defined. The whole URI should be unique but not necessarily should enable access to resource. The problem with URI arises with names of the objects that are not unique – the mechanism allows anyone to make statements about any resource. Another technique to identify a resource is to refer to its relationships with other resources. The RDF accepts resources that are not identified by any URI. These resources are known as blank nodes or b-nodes and are given internal identifiers, which are unique and not visible outside the application. Blank nodes can only stand as subjects or objects in particular triple. Predicates are special kind of resources, also identified by URIs, that describe relations between subjects and objects. Objects can be named by URIs or by constant values (literals) represented by character strings. These are the only elements that can be represented by plain string. Plain literals are strings extended by optional language tag. Literals extended by datatype URI are called typed literals. Objects are the only elements that can be represented by plain strings. Literals can be extended by the definition of the datatype, then the whole structure is called typed literal. RDF, unlike database systems or programming languages, does not have built-in datatypes – it bases on ones inherited from XML Schema2, e.g. integer, boolean or date. The use of externally defined datatypes is allowed, but in practice not popular (Manola & Miller 2004). The full triples notation requires that URIs are written as the complete name in angle brack- ets. However many RDF applications uses the abbreviated forms for convenience. The full URI reference is usually very long (e.g. <http://dbpedia.org/resource/Paisley>). It is shortened to prefix and resource name (e.g. dbpedia:Paisley). Prefix is assigned to the namespace URI. That mechanism is derived from XML syntax and is known as XML QNames3. 1 The Uniform Resource Indetifier (URI) is defined by RFC 3986. The generic syntax is URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]. Source: http://tools.ietf.org/html/rfc3986, [05.05.2008]. 2 The XML Schema datatypes are defined in W3C Recommendation “XML Schema Part 2: Datatypes” (Avail- able at: http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/), which is a part of specification of XML Schema language. 3 The QNames mechanism is described in “Using Qualified Names (QNames) as Identifiers in XML Content” avail- able at: http://www.w3.org/2001/tag/doc/qnameids.html. 21
  • 23. SPARQL <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfschema="http://www.w3.org/2000/01/rdf-schema#" xmlns:ns="http://xmlns.com/foaf/0.1/" xmlns:property="http://dbpedia.org/property/"> <rdf:Description rdf:about="http://dbpedia.org/resource/Paisley"> <rdfschema:label xml:lang="en">Paisley</rdfschema:label> <ns:img rdf:resource="http://upload.wikimedia.org/wikipedia/en/0/0d/RenfrewshirePaisley.png" /> <ns:page rdf:resource="http://en.wikipedia.org/wiki/Paisley" /> <rdfschema:label xml:lang="pl">Paisley (Szkocja)</rdfschema:label> <property:reference rdf:resource="http://www.paisleygazette.co.uk" /> <property:latitude rdf:datatype="http://www.w3.org/2001/XMLSchema#double">55.833333</property:latitude> <property:longitude rdf:datatype="http://www.w3.org/2001/XMLSchema#double">-4.433333</property:longitude> </rdf:Description> <rdf:Description rdf:about="http://dbpedia.org/resource/University_of_the_West_of_Scotland"> <property:city rdf:resource="http://dbpedia.org/resource/Paisley" /> <property:name xml:lang="en">University of the West of Scotland</property:name> <property:established rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1897</property:established> <property:country rdf:resource="http://dbpedia.org/resource/Scotland" /> </rdf:Description> <rdf:Description rdf:about="http://dbpedia.org/resource/William_Wallace"> <property:birthPlace rdf:resource="http://dbpedia.org/resource/Paisley" /> <property:death rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1305-08-23</property:death> <ns:name>William Wallace</ns:name> </rdf:Description> </rdf:RDF> Figure 2.2: RDF statements. Source: DBpedia (http://www.dbpedia.org), RDF/XML validated by http://www.rdfabout.com/demo/validator/validate.xpd, [12.03.2008] 22
  • 24. SPARQL Figure 2.2 presents a number of triples serialized in RDF/XML syntax using the most basic struc- tures. The preamble of the listing contains the XML Declaration that declares the namespaces (QNames) that are used in the document. Every subject is placed in <rdf:Description> tag. It is extended by URI placed in rdf:about attribute. Predicates are called property elements and they are placed within subject tag. Subject can contain one or multiple outgoing predicates. In Figure 2.2 every subject has a number of properties. Each property has the type of relation stated and the name of the object as attribute. Properties can also be extended by the datatype or language attributes. There are many methods of representing RDF statements. They can be encoded in XML syntax, but a graph-based view is a very popular representation. The RDF graph model is a collection of triples represented as a graph, where subjects and objects are depicted as graph nodes and predicates are represented by arc directed from the subject node to object node. The example of RDF graph is presented in Figure 2.3. In that case triples from Figure 2.2 were transformed into graph. The nodes referenced by URIs are shown in oval-shaped figures. Literals are written within rectangles. Every arc has the URI of the relationship stated. Graph-based view, due to its simplicity, is used for explaining the concept of triple. The other popular serialization formats for RDF are Notation3 (N3), JSON or Turtle. The RDF triples from Figure 2.2 encoded in Turtle syntax are presented in Figure 2.4. In that case, the triples are shown in actual subject-verb-object format. Turtle syntax is very straightforward. Every triple is written in one line ended by dot sign. Long URIs can be replaced by short prefix names declared using @prefix directive. Literals are simply extended by language suffix or by datatype URI. Turtle allows some abbreviations – when more than one triple involves the same subject it can be stated only once followed by the group of predicate-object pairs. A similar operation can be done when subject and predicate are constant. 23
  • 25. SPARQL Figure 2.3: RDF graph. Based on: DBpedia (http://www.dbpedia.org), [12.03.2008] 24
  • 26. SPARQL @PREFIX dbpedia: <http://dbpedia.org/resource/> . @PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @PREFIX foaf: <http://xmlns.com/foaf/0.1/> . @PREFIX dbpedia_prop: <http://dbpedia.org/property/> . @PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> . dbpedia:Paisley rdfs:label "Paisley"@en . dbpedia:Paisley foaf:img <http://upload.wikimedia.org/wikipedia/en/0/0d/RenfrewshirePaisley.png> . dbpedia:Paisley foaf:page <http://en.wikipedia.org/wiki/Paisley> . dbpedia:Paisley rdfs:label "Paisley (Szkocja)"@pl . dbpedia:Paisley dbpedia_prop:reference <http://www.paisleygazette.co.uk> . dbpedia:Paisley dbpedia_prop:latitude "55.833333"ˆˆxsd:double . dbpedia:Paisley dbpedia_prop:longitude "-4.433333"ˆˆxsd:double . dbpedia:University_of_the_West_of_Scotland [ dbpedia_prop:city dbpedia:Paisley; dbpedia_prop:name "University of the West of Scotland"@en; dbpedia_prop:established "1897"ˆˆxsd:integer; dbpedia_prop:country dbpedia:Scotland ] . dbpedia:William_Wallace dbpedia_prop:birthPlace dbpedia:Paisley . dbpedia:William_Wallace dbpedia_prop:death "1305-08-23"ˆˆxsd:date . dbpedia:William_Wallace foaf:name "William Wallace" . Figure 2.4: RDF statements in Turtle syntax. Source: DBpedia (http://www.dbpedia.org), [12.03.2008] The RDF has a few more interesting features. One of them is reification that provides the possi- bility to make statements about other statements. Reification of the statements can provide infor- mation about its creator or usage. It might be also used in process of authenticating the source of information. Another feature is the possibility to create containers and collections of resources that can be used for describing groups of things. Containers, according to the requirements, can be represented by a group of resources or literals with defined order as an option or by a group where members are alternatives to each other. A collection is also a group of elements but it is closed – once created it cannot be extended by any new members. The RDF provides a simple syntax for making statements about resources. However to define the vocabulary that will be used in a particular dataset there is a need to use RDF Vocabulary Description Language better known as RDF Schema (RDFS). The RDFS provides a means for describing classes of resources and defining their properties. In addition, a hierarchy of classes can be built. Similar to object-oriented programming every resource is an instance of one or more classes described with particular properties. The RDFS does not have its own syntax – it is expressed by the predefined set of RDF resources. 25
  • 27. SPARQL The resources are identified with the prefix http://www.w3.org/2000/01/rdf-schema# usually abbreviated to rdfs: QName prefix. To understand the special meaning of the RDFS graph the application has to provide such features, otherwise it is processed as a regular RDF graph. Although the RDF is supported by W3C it is not the only solution for the Semantic Web. Passin (2004, p.60) gives an example of Topic Maps as an ISO standard4 for handling semi- structured data. Topic Maps were originally designed for creating indexes, glossaries, thesauri and similar. However, their features made them applicable in more demanding domains. Topic Maps are based on a concept of topics and associations between topics and their occurrences. All structures have to be defined in ontologies of Topic Maps. The topics are represented with empha- sis on the collocation and the navigation – it is easier to find the particular information and browse closely related topics. Topic Maps can be applied as a pattern for organizing information. They can be implemented using many technologies using native XML syntax for Topic Maps (XTM) or even RDF. Their features make them well suited to be a part of the Semantic Web even though they are not supported by W3C. The RDF is a language that refers directly and unambiguously to a decentralized data model and unlike XML it is straightforward to differentiate information from the syntax. However, that technology has some limitations. According to Jorge Cardoso (2006) RDF with RDFS is not able to express the equivalence between terms defined in independent vocabularies. The cardinality and uniqueness of terms cannot be preserved. What is more the disjointness of terms and unions of classes are impossible to express with the limited functionality of RDF. There is also no possibility to negate statements. Antoniou & van Harmelen (2004, p.68) points out another limitation – RDF is using only binary predicates but in certain cases, it would be more natural to model a relation with more than two arguments. In addition, the concept of properties and reification can be misleading for the modellers. Finally, the XML syntax of RDF, being very flexible and accessible for machine processing, is hardly comprehensible for humans. Despite of all the disadvantages the RDF retains a good balance between complexity and expres- siveness. What is more it has become a de facto world standard for the Semantic Web, and is heavily supported by W3C and developers around the world. 4 Topic Maps were developed as ISO standards which is formally known as ISO/IEC 13250:2003. 26
  • 28. SPARQL 2.2. Querying the Semantic Web 2.2.1. Semantic Web as a distributed database One of the visions of the Semantic Web says that it is able to provide a common way to access, link and understand data from different sources available on-line. The Web will become a large interlinked database. This revolutionary approach challenges the current state of knowledge in managing data. Currently Relational Database Management Systems (RDBMS) are some of the most advanced software ever written. They are the largest data resources in the world. Over 30 years of experience in research and implementations has resulted in use of sophisticated mech- anisms like query optimization, clustering or retaining ACID properties5. Now the principles of the Semantic Web imply the need of implementing new technologies for managing semantic data. The Semantic Web has its basic data model – RDF. Passin (2004, p.25) says that RDF data model can be compared to Relational Data Model. In relational databases, data is organized in tables, where every row is identified by primary key and has a defined structure. A collection of attributes that forms a row is called a tuple. Every tuple can be divided into a number of RDF triples where the primary key becomes the subject. Tuples can be transformed into triples, but the reverse operation might not be possible. In general, RDF data model is less structured than database. Every table in the relational model has its defined structure which cannot be extended6 – data is structured and the number of attributes (properties) is known. RDF allows adding new triples extending the information about the resource. The triples can be partitioned between different nodes, even the ones that are not accessible. An RDBMS maintains consistency across all the data that it manages. Walton (2006) calls this the closed-world assumption, where everything that is not defined is false. On the contrary, in the Semantic Web, false information has to be specified or they are just unknown – this is an open-world model. Thanks to that, RDF is more flexible. However, such an assumption implies the possibility of inconsistency and missing information. The results of the query vary with the availability of datasets. The returned information can be only partial, and its size and computing time is unpredictable. 5 Atomicity, Consistency, Isolation, Durability (ACID) are the basic properties that should be fulfilled by Database Management System (DBMS) to ensure that transactions are processed reliably. 6 In fact every RDBMS permits the modifications of the table structure (ALTER TABLE command), but altering data model in such a way is not a regular operation so in that case can be omitted. 27
  • 29. SPARQL Walton (2006) claims that the Semantic Web data is more network structured than relational. In RDBMS, data is defined in the relation between static tables. Queries are performed on a known number of tables using set-based operations. In RDF, the data model before querying dataset, has to be separated from the whole Web of constantly changing stores. The constant change of asserted data implies that the results of the queries might be incomplete or even unavailable. What is more, Semantic Web knowledge can be represented in different syntactic forms (RDF with RDFS, OWL), which results in extended requirements for query languages as they have to be aware of the underlying representation. In addition, the structure of the datasets will be unknown to the querying engines, so they will have to rely on specified web services that will perform the required selection on their behalf. The Semantic Web principles put very strict constraints on the services that will manage and query semantic data. The RDF data model ensures simplicity and flexibility so the responsibility for the results of the queries will be borne by the query languages and automated reasoners. 2.2.2. Semantic Web queries The new data model that was designed for the Semantic Web required new technologies that would allow queries on semantic datasets. New query languages were needed to enable higher- level application development. The inspiration came from well established RDB Management Systems and Structured Query Language (SQL) that is used there for extracting relational data. However, the relational approach could not be directly translated into the semantic data model. The RDF data model with its graph-like model, blank nodes and semantics made the problem more complex. The query language has to understand the semantics of RDF vocabulary to be able to return correct information. That is why XML query languages, like XQuery or XPath, turned out to be insufficient as they operate on lower level of abstraction than RDF (Figure 1.4). To effectively support the Semantic Web, a query the language should have the following proper- ties (Haase, Broekstra, Eberhart & Volz 2004): ∙ Expressiveness — specifies how complicated queries can be defined in the language. Usu- ally the minimal requirement is to provide the means proposed by relational algebra. ∙ Closure — assumes that the result of the operation become a part of the data model, in the 28
  • 30. SPARQL case of RDF model, the result of the query should be in a form of graph. ∙ Adequacy — requires that query language working on particular data model use all its con- cepts. ∙ Orthogonality — requires that all operations can be performed as independent from the usage context. ∙ Safety — assumes that every syntactically correct query returns definite set of results. Query languages for RDF were developed in parallel with RDF itself. Some of them were closer to the spirit of relational database query languages, some were more inspired by rule languages. One of the first ones was rdfDB, a simple graph-matching query language that became an inspiration for several other languages. RdfDB was designed as a part of an open-source RDF database with the same name. One of the followers is Squish that was designed to test some RDF query language functionalities. Squish was announced by Libby Miller in 20017. It has several implementations, like RDQL and Inkling8. RQL bases on functional approach, that supports generalized path ex- pressions9. It has a syntax derived from OQL. RQL evolved into SeRQL. RDQL is a SQL-like language derived from Squish. It is a quite safe language that offers limited support for datatypes. RDQL had submission status in W3C but never became a recommendation10. A different approach was used in the XPath-like query language called Versa11, where the main building block is the list of RDF resources. RDF triples are used in traversal operations, which return the result of the query. Another language is Triple12, a query and transformation language, QEL, a query-exchange language developed as a part of Edutella project13 that is able to work across heterogeneous repos- itories, and DQL14, which is used for querying DAML+OIL knowledge bases. Triple and DQL represents rule-based approach. 7 RDF Squish query language and Java implementation available at: http://ilrt.org/discovery/2001/02/squish/, [02.05.2008] 8 Inkling Architectural Overview available at: http://ilrt.org/discovery/2001/07/inkling/index.html, [02.05.2008] 9 RQL: A Declarative Query Language for RDF available at: http://139.91.183.30:9090/RDF/publications/www2002/www2002.html, [02.05.2008] 10 http://www.w3.org/Submission/2004/SUBM-RDQL-20040109/ 11 Specification of Versa is available at: http://copia.ogbuji.net/files/Versa.html, [02.05.2008]. 12 Triple’s homepage is available at: http://triple.semanticweb.org/ , [02.05.2008] 13 Edutella is a p2p network that enables other systems to search and share semantic metadata. Homepage is available at: http://www.edutella.org/edutella.shtml, [02.05.2008]. 14 Specification of DQL is available at: http://www.daml.org/2003/04/dql/dql, [02.05.2008]. 29
  • 31. SPARQL The variety of RDF query languages developed by different communities resulted in compatibility problems. What is more, according to Guti´errez, Hurtado & Mendelzon (2004), different imple- mentations were using different query mechanisms that have not been a subject of formal studies, so there were doubts that some of them might behave unpredictably. W3C was aware of all that weaknesses. To decrease redundancy and increase interoperability between technologies W3C had formed in February 2004 an RDF Data Access Working Group (DAWG) that aimed to recommend a query language, which would become a worldwide standard. DAWG divided the task into two phases. At the beginning, they wanted to define the requirements for the RDF query language. They reviewed the existing implementations and wanted to choose a query language that would be a starting point for the further work in the next phase. In the second phase they prepared a formal specification together with test cases for the RDF query language (Prud’hommeaux 2004). In October 2004, the First Working Draft of SPARQL Query Language was published. 2.3. The SPARQL query language for RDF DAWG worked on SPARQL specification for more than a year. After six official Working Drafts15, in April 2006, DAWG published a W3C Candidate Recommendation for SPARQL Query Lan- guage for RDF. However, the community involved in developing a new standard pointed out a several weaknesses of that version of SPARQL specification and it was returned to Working Draft status in October 2006. After a few months and one more working Draft the specification reached a status of Candidate Recommendation in June 2007. When the exit criteria stated in the document were met (e.g. each SPARQL feature needed to have at least two implementations and the results of the test was satisfying), the specification went smoothly to Proposed Recommendation stage in November 2007. Finally, the SPARQL Query Language for RDF became a W3C recommendation on 15th of January 2008. The word SPARQL is an acronym of SPARQL Protocol and RDF Query Language (SPARQL 15 The official W3C Technical Report Development Process assumes that work on every document starts from the Working Draft. After positive feedback from the community there is a Candidate Recommendation being published. When the document gathers satisfying implementation experience it moves to Proposed Recommendation status. This mature document is waiting for the approval from W3C Advisory Committee. The last stage is the W3C Recommen- dation, which ensures that the document is a W3C standard. Source: World Wide Web Consortium Process Document (2005) 30
  • 32. SPARQL Figure 2.5: The history of SPARQL’s specification. Based on SPARQL Query Language for RDF (2008) Frequently Asked Questions 2008). In fact the SPARQL query language is closely related to two other W3C standards: SPARQL Protocol for RDF16 and SPARQL Query Results XML Format17. Although SPARQL is a W3C standard there are twelve open issues waiting to be resolved by DAWG. The SPARQL query language has an SQL-like syntax. Its queries use required or optional graph patterns and return a full subgraph that can be a basis for the further processing. SPARQL uses datatypes and language tags. Patterns can be also matched with the required functional constraints. Additional features include sorting the results and limiting their number or removing duplicates. SPARQL does not have the complete functionality that was requested by its users. Some of the features are being implemented as SPARQL extensions. To avoid inconsistency between imple- mentations W3C keeps a list of official SPARQL Extensions on their Wiki18. The list contains a number of missing features including the proposal for insert, update and delete features for SPARQL, creating subqueries or using aggregation functions. 16 SPARQL Protocol for RDF defines a remote protocol for transmitting SPARQL queries and receiving their results. It became a W3C Recommendation in January 2008. The specification is available at: http://www.w3.org/TR/rdf- sparql-protocol/. 17 SPARQL Query Results XML Format specify the format of XML document representing the results of SELECT and ASK queries. It was recognized as W3C recommendation in January 2008. The specification is available at: http://www.w3.org/TR/rdf-sparql-XMLres/. 18 The list is available at: http://esw.w3.org/topic/SPARQL/Extensions, [06.04.2008]. 31
  • 33. SPARQL 2.4. Implementation model SPARQL can be used for querying heterogeneous data sources that operates on native RDF or has an access to RDF dataset via middleware. The model of possible implementations is presented in Figure 2.4. Middleware in that case is mapping the SPARQL query into SQL, which operates on RDF data fitted into relational model. The main advantage of that approach is the possibility of using the advanced features of RDBMS and benefitting from the years of experience in managing huge amounts of data. However, the approach still requires the semantic data to be accessible as an RDF model. Nowadays a great amount of data is still being stored in relation model. To make it accessible it would have to be transformed into RDF data model, which would be time consuming and may not be always possible. Most of the current computer systems operate on the data encapsulated in relational model and revolution in such approach is very unlikely. One of the suggested solutions is the automatic transformation of relational data into the Semantic Web with the help of Relational.OWL (de Laborda & Conrad 2005). Figure 2.6: SPARQL implementation model. Source: Herman (2007a) Relational.OWL is an application independent representation format based on OWL language that describes the data stored in relational model together with the relational schema and its semantic 32
  • 34. SPARQL interpretation. The solution consists of three layers: Relational.OWL on the top, ontology created with Relational.OWL to represent database schema and data representation on the bottom, which is based on another ontology. It can be applied to any RDBMS. Relational data represented by Relational.OWL is accessible like normal semantic data, so can be queried by SPARQL. The main advantage of such approach is the possibility of publishing relational data in the Semantic Web with almost no cost of transforming them to RDF. What is more the changes of data stored relationally together with its schema are automatically transferred to its semantic representation. However all the imperfections of database schema affect the quality of the generated ontology. To avoid that, Relational.OWL can be extended with additional manual mapping as described in P´erez de Laborda & Conrad (2006). In that case, the possibility to generate a graph from the query results is being used. The subgraph involves the manual adjustments of the original ontology. Such a dataset is mapped to the target ontology and is free from the drawbacks of Relational.OWL automatic mapping. The technology is still under development. de Laborda & Conrad (2005) indicates only the pos- sibility of representing relational data as a mature feature. Further studies will be directed into supporting data exchanges and replication. A similar approach is found in the D2RQ language (Bizer, Cyganiak, Garbers & Maresch 2007). This is a declarative language that describes mappings between relational data and ontologies. It is based on RDF and formally defined by D2RQ RDFS Schema (http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1). The language does not support Data Modification Language, the mappings are available in read-only mode. D2RQ is a part of the wider solution called D2RQ Platform. Apart from the implementation of the language, the Platform includes the D2RQ Engine, which translates queries into SQL, and the D2R Server, which is an HTTP server with extended functionality including support for SPARQL. Another interesting implementation of such approach is Automapper (Matt Fisher & Joiner 2008). The tool is a part of a wider architecture that processes SPARQL query over multiple data sources and returns combined query result. Automapper uses D2RQ language to create data source ontol- ogy and mapping instance schema, both based on a relational schema. These ontologies are used for decomposing a semantic query at the beginning of processing and translating SPARQL into SQL just before executing it against RDBMS. To decrease the number of variables and statements 33
  • 35. SPARQL used in processing a query and to improve performance, Automapper uses SWRL rules that are based on database constraints. The solution is available in Asio Tool Suite, a software package for managing data created by BBN Technologies19. The implementations mentioned above are not the only ones available. The community gath- ered around MySQL is working on SPASQL20, a SPARQL support built into the database. Data integration solutions, like DartGrid or SquirellRDF21, are also available. Finally the all-in-one suits, like OpenLink Virtuoso Universal Server22, can be used for query non-RDF data stores with SPARQL or other Semantic Web query languages. Mapping relational databases, while having indisputable advantages, has also some limitations. Data in RDBMS very often are messy and they do not conform to widely accepted database design principles. To meet the expectations and provide high quality RDF data the mapping language has to be very expressive. It should have a number of features, like sophisticated transformations, conditional mappings, custom extensions and ability to cope with data organized at different level of normalization. Future users expect the data to be highly integrated and highly accessible. RDF datasets that has relational background are still not reliable. There is a need of some studies over mechanisms of querying multiple data sources, data sources discovery or schema mapping as the current solutions based on RDF and OWL are insufficient. Using a bridge between SPARQL and RDBMS is the most demanding problem, but the applica- tions will seriously increase the availability of semantic data. However, as depicted in Figure 2.4, it is not the only medium that SPARQL can query. Being very powerful RDF is a bit messy tech- nology. What is more embedding it into XHTML is rather useless as applications built around HTML do not recognise it. In addition, transforming data already available in XHTML would need significant amount of work. To simplify the process of embedding semantic data into web pages W3C started to work on set of extensions to XHTML called RDFa23. RDFa is a set of at- tributes that can be used within HTML or XHTML to express semantic data (RDFa Primer 2008). 19 BBN Technologies, http://www.bbn.com/. 20 SPASQL: SPARQL Support In MySQL, http://www.w3.org/2005/05/22-SPARQL-MySQL/XTech. 21 SquirellRDF, http://jena.sourceforge.net/SquirrelRDF/. 22 Openlink Virtuoso Universal Server Platform, http://www.openlinksw.com/virtuoso/. 23 The first W3C Working Draft was published in March 2006. For the time of writing RDF has still the same status – the latest Working Draft was published in March 2008. 34
  • 36. SPARQL It consists of meta and link attributes that are already existing in XHTML version 1 and a number of new ones that are being introduced by XHTML version 2. RDFa attributes can extend any HTML element, placed on document header or body, creating a mapping between the element and desired ontology and make it accessible as an RDF triple. The attributes does not affect the browser’s display of the page as HTML and RDF are separated. The most important advantage of RDFa is that there is no need to duplicate data publishing it in human-readable format and in machine-readable metadata. There are no standards of publishing RDFa attributes, so every pub- lisher can create their own ones. Another benefit is the simplicity of reusing the attributes and extending the already existing ones with new semantics. RDFa in some cases is very similar to microformats. However when each microformat has defined syntax and vocabulary, RDFa is only specifying the syntax and rely on vocabularies created by publishers or independent ones like FOAF or Dublin Core. Microformat is the approach to publishing metadata about the content using HTML or XHTML with some additional attributes specific for each format. Every application that is aware of these attributes can extract semantics from the document they were embedded in. They do not affect other software, e.g. web browsers. There are a number of different microformats, most of them developed by community gathered around Microformats.org. A very popular one is XFN, which is a way to express social relationships with the usage of hyperlinks. Other common microfor- mats are hCard and hCalendar, which are the way to embed information based on vCard24 and iCalendar25 standards in documents. Figure 2.7: The process of transforming calendar data from XHTML extended by hCalendar mi- croformat into RDF triples. Source: GRDDL Primer (2007). SPARQL is also able to query documents, which has some semantic information embedded in the content using e.g. microformats. To process a query over such document SPARQL engine need to 24 vCard electronic business card is common standard, defined by RFC 2426 (http://www.ietf.org/rfc/rfc2426.txt), for representing people, organizations and places. 25 iCalendar is a common format for exchanging information about events, tasks, etc. defined by RFC 2445 (http://tools.ietf.org/html/rfc2445). 35
  • 37. SPARQL know the “dialect” that was used for encoding metadata. Being aware of the barrier, W3C started to work on universal mechanism of accessing semantics written in non-standard formats. At the end of 2006, they introduced mechanism for Gleaning Resource Descriptions from Dialects of Languages (GRDDL). GRDDL introduced a markup that indicates if the document includes data that complies with the RDF data model, in particular documents written in XHTML and generally speaking in XML. The appropriate information is written in the header of the document. Another markup links to the transformation algorithm for extracting semantics from the document. The algorithm is usually available as XSLT stylesheet. The SPARQL engine extracts the metadata from the document, applying transformations fetched from the relevant file, and presents data as in the RDF data model. The process of transforming metadata encoded in a specific “dialect” into RDF is depicted in Figure 2.7. SPARQL together with some related technologies was designed to be a unifying point for all the semantic queries. SPARQL engines will be able to serve dedicated applications and other SPARQL endpoints providing information that they can extract from the documents that are di- rectly accessible for it. Some implementations of this mechanism already exist. One of them is the public SPARQL endpoint to DBpedia26 that is able to return data from other semantic datastores that are linked to its dataset. 2.5. SPARQL’s syntax SPARQL is a pattern-matching RDF query language. In most cases, the query consists of set of triple patterns called basic graph pattern. The patterns are similar to RDF triples. The difference is that each of the elements can be set as a variable. That pattern is matched against RDF dataset. The result is a subgraph of original dataset where all the constant elements of patterns are matched and the variables are substituted by data from matched triples. The pair of variable and RDF data matched to the variable is called a “binding”. The number of related bindings that form a row in the result set is known as the “solution”. The SPARQL basic syntax is very similar to SQL – it starts with SELECT clause called projection, which identifies the set of returned variables, and ends with WHERE clause providing a basic graph pattern. Variables in SPARQL are indicated by $ or ? prefixes. Similarly to Turtle syntax URIs 26 DBpedia public SPARQL endpoint is available at: http://dbpedia.org/sparql, [02.05.2008]. 36
  • 38. SPARQL can be abbreviated using PREFIX keyword and prefix label with a definition of the namespace. If the namespace occurs in multiple places, it can be set as a base URI. Then relative URIs, like <property/>, are resolved using base URI. Triple patterns can be abbreviated in the same way as in Turtle syntax – a common subject can be omitted using “;” notation and a list of objects sharing the same subject and predicate can be written in the same line separated by “,”. The query results can contain blank nodes, which are unique in the subgraph and indicated by “ :” prefix. The simple query to find a name of the university in Paisley from the dataset presented in Figure 2.4 is shown in Figure 2.8 BASE <http://dbpedia.org/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX dbpedia: <property/> SELECT DISTINCT ?city ?uniname WHERE { ?city rdfs:label "Paisley (Szkocja)"@pl . ?uni dbpedia:city ?city . ?uni dbpedia:established "1897"ˆˆxsd:integer . ?uni dbpedia:name ?uniname . } city uniname http://dbpedia.org/resource/Paisley University of the West of Scotland Figure 2.8: Simple SPARQL query with the result. Source: DBpedia (http://www.dbpedia.org), [12.04.2008] SPARQL has a number of different query result forms. SELECT is used for obtaining variable bindings. Another form is CONSTRUCT that returns an RDF dataset build on the graph pattern that is applied to the subgraph returned by the query. This feature can be used to create RDF subgraphs that become a base for the further processing, e.g. Relational.OWL is using it to map automatically created ontology based on relational schema into desired ontology. Figure 2.9 presents the usage of CONSTRUCT clause to build a subgraph according to required pattern. Another two forms are ASK and DESCRIBE. First of them returns a boolean value that indicates if the query pattern matches the RDF graph or not. The usage of the ASK clause is similar to the SELECT clause, the only difference is that there is no specification of returned variables. DESCRIBE is used to obtain all triples from RDF dataset that describe the stated URI. 37
  • 39. SPARQL PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dbpedia: <http://dbpedia.org/property/> CONSTRUCT {?uni <http://dbpedia.org/property/located_in> ?city. ?uni <http://dbpedia.org/property/has_name> ?uniname } WHERE { ?city rdfs:label "Paisley (Szkocja)"@pl . ?uni dbpedia:city ?city . ?uni dbpedia:established "1897"ˆˆxsd:integer . ?uni dbpedia:name ?uniname . } Returned RDF subgraph serialized in Turtle: <http://dbpedia.org/resource/University_of_the_West_of_Scotland> <http://dbpedia.org/property/located_in> <http://dbpedia.org/resource/Paisley>; <http://dbpedia.org/property/has_name> "University of the West of Scotland"@en. Figure 2.9: Application of CONSTRUCT query result form with the results of the query serialized in Turtle syntax. Source: DBpedia (http://www.dbpedia.org), [12.04.2008] Every query language should provide possibilities to filter the results returned by the generic query. SPARQL uses FILTER clause to restrict the result by adding filtering conditions. Using condi- tions SPARQL can filter the values of the strings with regular expressions defined in XQuery 1.0 and XPath 2.0 Functions and Operators (2007) W3C specification. Also a subset of functions and operators used in XPath27 is available – all the arithmetic and logical functions comes from that language. However SPARQL introduces a number of new operators, like bound(), isIRI() or lang(). All of them are described in detail in the SPARQL Query Language for RDF (2008). There is also a possibility to use external functions defined by an URI. That feature may be used to perform transformations not supported by SPARQL or for testing specific datatypes. After applying filters, SPARQL returns the result of graph pattern matching. However, the list of query solutions is in random order. Similarly to SQL, SPARQL provides a means to modify the set of results. The most basic modifier is ORDER BY clause that orders the solutions according to the chosen binding. The solutions can be ordered ascending, using ASC() modifier, or descending indicated by DESC() modifier. It is common that the solutions in result dataset are multiplied. The keyword DISTINCT ensures that only unique triples are returned. The REDUCED modifier has similar functionality. However 27 XML Path Language (XPath) is a language to address parts of the XML document. It provides a possibilities to perform operations on strings, numbers or boolean values. XPath is now available in version 2.0, which is a W3C Recommendation since January 2007. Source: XML Path Language (XPath) 2.0 (2007) 38
  • 40. SPARQL when DISTINCT ensures that duplicate solutions are eliminated, REDUCED allow them to be eliminated. In that case the solution occurs at least once, but not more than when not using the modifier. Another two modifiers affect the number of returned solutions. The keyword LIMIT defines how many solutions will be returned. The OFFSET clause determines the number of solutions after which the required data will be returned. The combination of these two modifiers returns a particular number of solutions starting at the defined point. BASE <http://dbpedia.org/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX dbpedia: <property/> SELECT DISTINCT ?uniname ?countryname ?no_students ?no_staff ?headname WHERE { { ?uni dbpedia:type <http://dbpedia.org/resource/Public_university>. ?uni dbpedia:country ?country. ?country rdfs:label ?countryname. ?uni dbpedia:undergrad ?no_students. ?uni dbpedia:staff ?no_staff. ?uni rdfs:label ?uniname. FILTER (xsd:integer(?no_staff) < 2000). FILTER (regex(str(?country), "Scotland") || regex(str(?country),"England")). FILTER (lang(?uniname)="en") FILTER (lang(?countryname)="en") } OPTIONAL {?uni dbpedia:head ?headname} } ORDER BY DESC(?no_students) LIMIT 5 uniname countryname no students no staff headname Napier University Scotland 11685 1648 University of the West of Scotland Scotland 11395 1300 Professor Bob Beaty University of Stirling Scotland 6905 1872 Alan Simpson Aston University England 6505 1,000+ Heriot-Watt University Scotland 5605 717 Gavin J Gemmell Figure 2.10: SPARQL query presenting universities with its number of students, number of staff and optional name of the headmaster with some filtering applied. Below are the results of the query. Source: DBpedia (http://www.dbpedia.org), [20.04.2008] Supporting only basic graph patterns in some cases might be a very serious limitation. SPARQL provides mechanisms to combine a number of small patterns to obtain more complex set of triples. The simplest one is a group graph pattern where all stated triple patterns have to match against given RDF dataset. Group graph pattern is presented in Figure 2.8. A result of graph pattern match can be modified using OPTIONAL clause. The RDF data model is a subject of constant change, so 39
  • 41. SPARQL assumption of full availability of desired information is too strict. Opposite to group graph pattern matching OPTIONAL clause allows to extend the result set with additional information without eliminating the whole solution if that particular information is inaccessible. When the optional graph pattern does not match, the value is not returned and the binding remains empty. If there is a need to present a result set that contains a set of alternative subgraphs, SPARQL provides a way to match more than one independent graph pattern in one query. This is done by employing UNION clause in the WHERE clause that joins alternative graph patterns. The result consists of the sequence of solutions that match at least one graph pattern. Finally, the SPARQL can restrict the source of the data that is being processed. RDF dataset always consists of at least one RDF graph, which is a default graph and does not have any name. The optional graphs are called named graphs and are identified by URI. SPARQL is usually querying the whole RDF dataset, but scope of the dataset can be limited to a number of named graphs. The specification of RDF dataset is set by URI using FROM clause, which indicates the active dataset. The representation of the resource identified by URI should contain the required graph – this can be e.g. a file with a RDF dataset or another SPARQL endpoint. If a combination of datasets is referred to by the FROM keyword, the graphs are merged to form a default RDF graph. To query a graph without adding it to the default dataset, the graph should be referred to by FROM NAMED clause. In that case the relation between RDF dataset and named graph is indirect, named graph remains independent to the default graph. To switch between the active graphs SPARQL uses the GRAPH clause. Only triple patterns that are stated inside the clause are matched against the active graph. Outside the clause, the triple patterns are matched against the default graph. GRAPH clause is very powerful. It can be used not only to provide solutions from specific graphs, but is also very useful for the right graph containing desired solution. SPARQL is a technology that the whole community was waiting for. Its official specification regulates the access to RDF datastores which will result in increased popularity of the whole concept and cause SPARQL to be regarded as not just the technology for academia, but as the stable solution that is worth implementing in common data access tools. However the current specification of SPARQL does not fully met the requirements. The com- munity has pointed out the lack of data modification functions as one of the most serious issues. Another problem is an inability to use cursors caused by the stateless character of the proto- 40
  • 42. SPARQL col. SPARQL does not allow computing or aggregating results. This has to be done by external modules. What is more, querying collections and containers may be complicated, which may be especially inconvenient while processing OWL ontologies. Finally the lack of support for fulltext searching is quite problematic. Apart from that SPARQL is a significant step on the way to the Semantic Web, but also a starting point for the research on the higher layers of the Semantic Web “layer cake” diagram. However there is a place for improvement and further research. W3C should consider starting to work on the next version of SPARQL Query Language. 2.6. Review of Literature about SPARQL SPARQL Query Language for RDF is a relatively new technology. Indisputably it is gaining popularity within the Semantic Web community, but there is still little research so far on the language itself and its implementability. Google Scholar returns only 2030 search results for the word “sparql”. This is almost nothing comparing to the number of search results when looking for the word “rdf” – 237000, or documents related to “semantic web” – 34400028. Google Scholar is not an objective source of knowledge – the number of results may vary depending on date and if the local version of the search engine is used. However it shows how big is the difference in popularity between stable RDF and brand-new SPARQL. What is more the number of publications where SPARQL query language and the implementation issues are being under research is very small. Usually SPARQL appears in the context of the complex architecture that is being implemented to solve a particular problem with the means provided by the Semantic Web. The first complete study of the requirements that semantic query language has to meet was done in “Foundations of Semantic Web Databases” (Guti´errez et al. 2004). According to the paper, the new features of RDF, like blank nodes, reification, redundancy and RDFS with its vocabulary, need a new approach to queries in comparison to relational databases. The authors at the beginning propose the notion of normal form for RDF graphs. The notion is a combination of core and closed graphs. A core graph is one that cannot be mapped into itself. An RDFS vocabulary together with all the triples it applies to is called a closed graph. The problem is the redundancy of triples. The authors describe an algorithm that allows reduction of the graph. Even so computing the normal 28 The test was performed using http://scholar.google.pl on 6.05.2008. 41
  • 43. SPARQL and reduced forms of the graph is still very difficult. On that theoretical background a formal definition of RDF query language is given. A query is a set of graphs considered within a set of premises with some of the elements replaced by variables limited by a number of constraints. The answer to a query is a separate and unique graph. A very important property that every query language should have is the possibility to compose complex queries from the results of the simpler ones (compositionality). A union or merge of single answers can achieve this. In the first case, the existing blank nodes have unique names, while in merging the result sets the names of the blank nodes have to be changed. The union operation is more straightforward and can create data independent queries. The merge operator is more useful for querying several sources. Finally, the authors discuss the complexity of answering queries. Similar theoretical deliberations on semantic query language can be found in “Semantics and Complexity of SPARQL” (Perez, Arenas & Gutierrez 2006a). However this time the authors start from the RDF formalization done in Guti´errez et al. (2004) to examine the graph pattern facility provided by SPARQL. Although the features of the SPARQL seem to be straightforward, in com- bination they create increased complexity. According to the authors, SPARQL shares a number of constructs with other semantic query languages. However, there was still a need to formalize the semantics and syntax of SPARQL. The authors consider graph pattern matching facility limited to one RDF data set. They start by defining the syntax of a graph pattern expression as a set of graph patterns related to each other by 𝐴𝑁 𝐷, 𝑈 𝑁 𝐼𝑂𝑁, 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 operators and limited by 𝐹 𝐼𝐿𝑇 𝐸𝑅 expression. Then they define the semantics of the query language. It turns out that op- erators 𝑈 𝑁 𝐼𝑂𝑁 and 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 makes the evaluation of the query more complex. There are two approaches for computing answers to graph patterns. The first one uses operational seman- tics what means that the graphs are matched one after another using intermediate results from the preceding matchings to decrease the overall cost. The second approach is based on bottom up eval- uation of the parse tree that minimizes the cost of the operation using relational algebra. Relational algebra can be easily applied to SPARQL, however there are some discrepancies. The lack of con- straints in SPARQL makes the 𝑂𝑃 𝑇 𝐼𝑂𝑁 𝐴𝐿 operator not fully equal to its relational counterpart – left outer join. Further issues are null-rejecting relations, which are impossible in SPARQL, and Cartesian product that is often used in SPARQL. Finally, the authors state the normal form of an optional triple pattern that should be followed to design cost-effective queries. It assumes that all patterns that are outside optional should be evaluated before matching the optional patterns. 42
  • 44. SPARQL Similar conclusion are drawn while evaluating graph patterns with relational algebra in Cyganiak (2005b). The authors of Perez et al. (2006a) continue their studies on semantics of SPARQL in “Semantics of SPARQL” (Perez, Arenas & Gutierrez 2006b). The goal of this technical report was to update the original publication with the changes introduced by W3C Working Draft published in October 2006. The authors extend the definitions of graph patterns stated in the previous paper and discuss the support for blank nodes in graph patterns and bag/multiset semantics for solutions. At the beginning, the authors state the basic definitions of RDF and basic graph patterns. Then they define syntax and semantics for the general graph patterns. They also include the GRAPH operator, which defines the default graph that is matched against the query. Another extension to Perez et al. (2006a) is the semantics of query result forms. SELECT and CONSTRUCT clauses are also being discussed. Finally, the definition of graph patterns is extended by the support for blank nodes and bags. The main problem they indicate is the increased cardinality of the solutions. They finish the report with two remarks about query entailment, which was not fully defined at the time of writing. The author of “A relational algebra for SPARQL” (Cyganiak 2005b) does not focus on the generic definition of SPARQL queries. He transforms SPARQL into relational algebra, which is an in- termediate language for the evaluation of queries that is widely used for analysing queries on the relational model. Such an approach has significant advantages – it provides knowledge about query optimization for SPARQL implementers, makes the SPARQL support in relational databases more straightforward and simplifies the further analyses on the queries over distributed data sources. The author presents only queries over basic graph. Some special cases are also considered, however the filtering operator still has to be put under research. At the beginning author assumes that RDF graph can be presented as a relational table with 3 columns corresponding to ?𝑠𝑢𝑏𝑗𝑒𝑐𝑡, ?𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒 and ?𝑜𝑏𝑗𝑒𝑐𝑡. Each triple is stored as a separate record. There is also a new term introduced. An RDF tuple, which example is presented in Figure 2.11, is a container that maps a number of variables to RDF terms and is also known as RDF solution. Tuple is an universal term used in relational algebra. Every variable present in a tuple is said to be bound. A set of tuples forms an RDF relation. The relations can be transformed into triples and form a data set. 43