This document provides a summary of a talk given by Tope Omitola on using linked data for world sense-making. The talk discussed EnAKTing, a project focused on building ontologies from large-scale user participation and querying linked data. It also covered publishing and consuming public sector datasets as linked data, including challenges around data integration, normalization and alignment. The talk concluded with a discussion of linked data services and applications developed by the project to enhance findability, search, and visualization of linked data.
1. World Sense-Making using
Linked Data
Tope Omitola
(joint work with Prof. Nigel Shadbolt)
Faculty Research Seminar Talk,
Birmingham City University, UK
Thurs 8 Dec. 2011
1
4. Talk Outline
EnAKTing: Its story
From the Web to Semantic Web to Linked Data
Public Sector Datasets: Publication and Consumption
Findability of Appropriate Data Sources – Service Descriptions
Provenance and Trust in Linked Data
5. What is EnAKTing?
EPSRC-funded project.
Addressing 3 key research problems; (1) how to build
ontologies quickly that are capable of exploiting the
potential of large-scale user participation, (2) how we
query an unbounded web of linked data, (3) how to
visualise, explore, browse and navigate this mass of
data.
Project Leaders: Prof. Sir Tim Berners-Lee, Prof. Dame
Wendy Hall, and Prof. Nigel Shadbolt.
6. From the Web to Semantic Web to Linked Data
The Web of Data
Problems with the Web of Document
RDF
Linked Data
7. The Web of Data
(a.k.a Semantic Web/Linked Data)
Traditional Web of Documents
Internet, Documents, Links
Documents in HTML
Links using URLs
HTTP for document access and transfer
9. Some more problems with Web of Documents
Difficult to Integrate Data
Example Use Case: Making a Travel Plan
Data Integration by looking and typing
Slow Unproductive Workflow
Difficult for apps to make “sense” of HTML text
10. Solutions
Use RDF to give some structure to the data
RDF <-> subject predicate object
RDF links things, not just documents, and they
are typed
11. RDF is a language (for data)
Words URIsand literal text
Nouns and Verbs Classes andProperties
Sentence structure RDF Statements (triples)
Paragraphs RDF Graphs
Footnotes URIs[Domain Name Service]
Dictionaries RDF Schemas
• Generic grammar for languages of description
• Functions as native language, second language, or pidgin.
12. RDF and Ontology
The AAA Slogan: “Anyone can say Anything
about Any topic.”
s po . (subject predicate object .)
<http://en.wikipedia.org/wiki/Tony_Benn><http:/
/purl.org/dc/elements/1.1/title> "Tony Benn” .
RDF is used to build ontologies; a formal
representation of shared knowledge by a set
of concepts within a domain and the
relationships between them
Examples: Finance ontology; MusicBrainz,
music ontology; GO, gene ontology, etc
13. What is Linked Data?
Data, data, everywhere: We are surrounded
by data: School performance, car fuel
efficiency, etc
Data help us to make better decisions
You can discern the shape and structure of an
entity by looking at the data it generates
Data shapes conversations and markets
14. What is Linked Data?
Linked Data: Framework where data is a first class
citizen on the Web
Evolving the current Web into a Global Data
Space
TimBL: 4 principles of Linked Data
Use URIs as names for things, Use HTTP URIs, When
someone looks up a URI, provide useful information,
using the standards (RDF, etc), Include links to other
URIs, so that they can discover more things
15. The Web of Linked Data
Link everything. No silos.
Thing Thing Thing
Thing Thing
Thing
16. The Web of Linked Data
Linked Data (Semantic Web ) is a graph
database:
17. Linked Data
Advantage comes from linking the RDF(s)
together.
17
19. Linking (Linked) Open Data cloud
linkeddata.org
Many of the datastores are being linked
together to form a network/graph.
19
20. Linked Data
In summary:
Linked Data provides: RDF
A standardized data access mechanism, HTTP
Hyperlink-based data discovery, using URIs
Self-descriptive data, through using shared
vocabularies
21. Government Linked Data
Explosion of Government (Linked) Open
Data efforts and projects.
data.gov, data.gov.uk, data.gov.au
Examples:
22.
23. Public Sector Datasets
Inherent value in opening up public government
data
Systems and Services can be tailored to citizens’
priorities.
Likely questions citizens may need answers to
are:
– “Where can I find a good school, a good investment
advisor, a good employer?”
23
24. Public Sector Datasets
(contd.)
Integration of datasets enables more complex
questions to be asked and answered
Some examples:
– http://www.planningalerts.com/
– http://ishortman.com/projects/expendituremap/
Governments freeing up their data.
Holy grail is information integration: Meshing.
24
25. Issues we focus on
Findability of appropriate data sources
SEARCH: Look at the data sources
EXTRACT: Slicing of data sources
INTEGRATE: Unifying the views
EXPLORE: Answering the questions.
28. Workflow
Identify Dataset
Design/ Select Vocabularies
Extract and convert data into RDF
Publish as Linked Data
Consume Linked Data
(Application)
28
29. Publishing your data as Linked Data: Some Things to
Consider
How do you choose a good URI to name things? There are
guidelines for this. Examples:
http://dbpedia.org/resource/Wildlife_photography
Tope Omitola @ Univ of Southampton:http://id.ecs.soton.ac.uk/person/24123
.
Describing a Data Set using: voiD (the Vocabulary of Interlinked
Datasets)
Choosing and Using Vocabularies to Describe Data (SKOS, RDFS,
OWL, scovo)
Sourcing datasets: Where do you get the datasets from (e.g. Semantic
Web search engines, manual search, etc)
Choice of join points: When you have different datasets, where do you
join them together
Data normalization: using RDF make things easier.
Alignment of datasets
30. Architecture
Infer new
Data concepts and
Integration relationships
SPARQL
RDF
Gatherers
Data and RDF Triplestore
Sources Extractors
(4store) Services
30
31. Data Publication – Challenges and Solutions
Research Questions:
– In our case, dealing with data that are centred around
the United Kingdom’s democratic system,
– Using geography data from the UK’s Ordnance Survey
as the “join-point” with data for criminal statistics,
Members of Parliament, mortality rates, etc.
Sourcing the datasets
– Many government data sets are in pdf, html, or xls
files, so automatic discovery methods are not possible
(yet),
– Went through manual discovery process, searching for
them,
– We found some in pdf, html, and in xls,
– We decided against pdf and html
31
32. Data Publication – Challenges and Solutions (contd.)
– We went for data in xls format. Why?
• Ability to source from a wider range of public sector
domains.
Data Source Format Dataset
Publicwhip.org.uk HTML MP votes records, etc
Theyworkforyou.com XML dump Parliament, Parliament
expenses
Homeoffice.gov.uk Excel Recorded crime
(England, 2008/09)
Statistics.gov.uk Excel Hospital Waiting List
(England 2008/09)
Performance.doh.gov.uk Excel Mortality rates
(England 2008/09)
Ordnancesurvey.co.uk Linked Data UK’s mapping agency 32
33. Data Publication – Challenges and Solutions (contd.)
Data normalisation.
RDF as our standard model.
Data conversion to RDF. Python + Java.
Modelling the datasets: Multi-dimensional,
used SCOVO.
33
34. Data Publication – Challenges and Solutions (contd.)
Crime dataset:
Table 7.03 Recorded crime by offence group by police force area, English region and
Wales, 2008/09
Recorded
Numbers crime
Police force area, English Total Violence Sexual Robbery Burglary Offences Other Fraud Criminal Drug Other
region and Wales against offences against theft and damage offences offences
1
the vehicles offences forgery
person
Numbers
Cleveland 55,094 10,662 566 404 6,175 5,224 13,697 905 13,746 2,636 1,079
Durham 45,074 7,435 476 170 6,226 4,940 9,674 835 13,027 1,327 964
Northumbria 105,234 19,147 989 732 11,418 11,620 24,042 2,909 27,178 5,166 2,033
North East Region 205,402 37,244 2,031 1,306 23,819 21,784 47,413 4,649 53,951 9,129 4,076
:TimePeriodrdf:typeowl:Class; rdfs:subClassOfscovo:Dimension.
:TP2008_09 rdf:type :TimePeriod.
:GeographicalRegionrdfs:subClassOfscovo:Dimension;
dc:title "Police force area, English region and Wales".
:CriminalOffenceTyperdf:typeowl:Class; rdfs:subClassOfscovo:Dimension.
34
35. Some Issues in Linked Data
Co-referencing, i.e. different sources referring to
the same entities by different names.
Cardiff in
Dbpediahttp://dbpedia.org/resource/Cardiff or
http://dbpedia.org/resource/Cardiff_City
Cardiff in
Geonameshttp://sws.geonames.org/2172349/
Which Cardiff shall we use?
Solution: sameas service from Southampton
35
38. Alignment of Datasets (contd.)
Asserted owl:sameAs relations between dataset geo
and O.S. (using string matching)
For example, the English county of Cumbria was
aligned as the following:
<http://enakting.ecs.soton.ac.uk/statistics/data/Cumbria>
http://www.w3.org/2002/07/owl#sameAs
<http://data.ordnancesurvey.co.uk/id/7000000000024876>.
A few special cases. “Yorkshire and the Humber
Region” vs “Yorkshire & the Humber”
NHS Trust were labelled differently: e.g. South Tyneside NHS
Trust had no equivalence in the OS. So used Google Maps.
38
41. Recap: Data Publication
Sourcing : Many not in RDF yet. Some in html,
pdf, and xls. We chose xls.
Selection of RDF as the normal form.
Used scovo to model multidimensional data.
We used owl:sameAs to assert equivalences
between geo regions.
We used string matching. Some did not work,
e.g. Yorkshire and the Humber. Some have no
equivalent OS entities, so we had to go via
Google Maps API
41
42. Consuming Linked Data
How do you visualize linked data sets.
Linked Data browsers, e.g. Disco, Tabulator.
Linked Data Search Engines, e.g. Sig.ma,
Falcons, Sindice.
Domain-specific Applications and Mashups,
e.g. dayta.me(from Southampton), US Global
Foreign Aid Mashup.
43. Data Consumption
Application acts as an aggregator of
information based on user’s postal (zip) code.
Generates data views based on geographical
region of postal code.
Shows political representatives (MPs) for
constituencies, their voting records, and their
expenses.
43
45. Data Consumption(contd.)
Challenges:
– The lack of UIs to quickly browse, search or visualise
views on a widerange of differently modelled data,
– Lack of suitable tools which allow efficient
aggregation and presentation of datato the UI from
multiple datasets,
– Data consumers having partial knowledge of domain
and finding it difficult to understand the domain and
the data being modelled.Points out the need for a
toolset that helps developers givebetter description of
the domain being modelled.
45
46. Recap: Publish and Consume
Information Integration; one of the holy grails
Problems with data sources. Different formats, etc,
RDF can act as a standard model.
Publication to RDF. Challenges. Solutions.
– scovo for multi-dimensional data
– string matching and its complexities
Consuming the data. Challenges. Solutions.
– Aggregating data based on zip code
– Complexities of geo boundaries
We have re-published the data we generated into the
linked data cloud: EnAKTing datasets
www.enakting.org/enakting/datasets
46
47. Some of our Outputs
http://geoservice.psi.enakting.org: service to discover
geographical resources,
http://map.psi.enakting.org/: integrate different PSI Linked
Data sources by querying Backlinking service,
http://backlinks.psi.enakting.org: service to discover back-
links in PSI,
http://void.rkbexplorer.com/: describes the contents of
data sets, enabling discovery and reuse of resources,
http://bagatelles.ecs.soton.ac.uk/psi/: platform for
integrating several PSI catalogues from the Web
http://4sreasoner.ecs.soton.ac.uk/ Scalable Reasoning in
4store; 4sr is a branch of4store where backward chained
reasoning is implemented
http://apps.seme4.com/see-uk/ : Visualization tool for
some UK data
47
50. Findability of Appropriate Data Sources – Service
Descriptions
How do you tell the world about your new
linked data sets?
Provide good service descriptions of your data
sets
Use vocabulary of Interlinked Datasets
51. Vocabulary of Interlinked Datasets (VoID)
allows description of datasets and their
interlinking, e.g. "there are 200k links of type
gr: predicates between dataset X and dataset Y;
and dataset Y mainly offers data about homes
and X about mortgages” .
A dataset: a set of RDF triples published,
maintained or aggregated by a single provider,
and accessible on the Web, e.g.
:DBpedia a void:Dataset .
allows the description of RDF links between
datasets (using void:Linkset).
52. Three Areas of voiD
General Metadata
Access Metadata
Structural Metadata
53. voiD (contd.)
General metadata: the dataset's title,
description, date of creation, the creator,
publisher, licence, subject(s), etc;
:DBpedia a void:Dataset;
dcterms:title "DBPedia";
dcterms:description "RDF data extracted from Wikipedia";
dcterms:contributor :FU_Berlin;
dcterms:modified "2008-11-17"^^xsd:datedcterms:contributor
:OpenLink_Software.
54. Access metadata: describes how the RDF data(set) can be
accessed
using sparql e.g.
:DBpedia a void:Dataset;
void:sparqlEndpoint<http://dbpedia.org/sparql>.
using URI lookup,
Sindice a void:Dataset ;
void:uriLookupEndpoint<http://api.sindice.com/v2/
search?qt=term&q=> .
using rdf dumps,
:NYTimes a void:Dataset;
void:dataDump<http://data.nytimes.com/people.rdf>.
55. Structural metadata describes the structure and schema of
datasets
naming some representative example entites for
a dataset
stating if datasets' entities share common URIs
:DBpedia a void:Dataset;
void:uriSpace "http://dbpedia.org/resource/” .
Stating the vocabularies used in a dataset
:LiveJournal a void:Dataset;
void:vocabulary<http://xmlns.com/foaf/0.1/>.
Providing statistics about datasets, e.g.
expressing the number of RDF triples or the
number of entities of a dataset.
:DBpedia a void:Dataset;
void:triples 1000000000 ; void:entities 3400000.
56. Publishing voiD files
as void.ttl in the root directory of the site, with a
local “hash URI” for the dataset, e.g.
http://example.com/void.ttl#MyDataset.
Using the root URI of the site, such as
http://example.com/,
as the dataset URI, and serving
both HTML and an RDF format via content
negotiation from that URI.
Embedding the VoID description as HTML+RDFa
into homepage of dataset, with a local “hash URI”
for the dataset, yielding URI such as
http://example.com/#MyDataset.
57. Why is voiD useful -- voiD Discovery
By enabling the discovery and usage of linked
datasets.
A sitemap such as http://www.yoursite.com/sitemap.xml
references void.ttl, and sitemap.xml added robots.txt
. A search engine crawls the website indexing
void.ttl plus a cache of the rdf triples referenced in
this void file.
through backlinks:
<document.rdf>void:inDataset<void.ttl#MyDataset>.
Through a well-known URI: void.ttl can be placed
in /.well-known/void on any Web server , e.g.
http://www.example.com/.well-known/void .
58. @prefix void: <http://rdfs.org/ns/void#> . @prefix scovo: <http://purl.org/NET/scovo#> .
<http://crime.psi.enakting.org/id/void>
a void:Dataset;
foaf:homepage<http://crime.psi.enakting.org/>;
rdfs:label "crime.psi.enakting.org Linked Data Repository";
dcterms:date "2010-09-13T11:30:29"^^xsd:date;
dcterms:title "crime.psi.enakting.org Linked Data Repository";
foaf:nick "crime";
dcterms:description "United Kingdom's crime statistics per region for the year 2008/09, provided by the
United Kingdom Home Office. Dataset provenance:
http://www.homeoffice.gov.uk/rds/pdfs09/hosb1109chap7.xls";
dcterms:publisher<http://crime.psi.enakting.org>;
void:statItem [
scovo:dimensionvoid:numberOfTriples; rdf:value 4988; rdfs:label "4,988 triples”;
];
void:subset [
a void:Linkset; rdfs:label "crime.psi.enakting.org CRS -> http://data.ordnancesurvey.co.uk/";
void:subjectsTarget<http://crime.psi.enakting.org/id/void>;
void:objectsTarget<http://void.rkbexplorer.com/id/dataset/d1d473f29a9091069644824242e9ae07>;
void:linkPredicatecoref:duplicate;
void:statItem [
rdfs:label "133 URI equivalences"; rdf:value 133; scovo:dimensionvoid:numberOfTriples;
] ].
60. Provenance and Trust
Mash-ups, aggregation, integration, data re-use.
How do you elicit Reliability and Accuracy?
Generate trust by revealing as much information of
you as possible.
Enables consumers to decide the quality and
trustworthiness of your data.
Useful for Data Discovery/Mining + Query
Planning.
61. Different kinds of Provenance
When was x derived (when-provenance).
How was x derived (how-provenance).
What data was used to derive x (what-
provenance).
Who carried out the transformation(s) from
whence x came (who-provenance).
64. Provenance Models for Linked Datasets (contd)
Provenance for Datasets (voidp)
http://www.enakting.org/provenance/voidp/
65. voiD Provenance Extension voidp
Designed to be simple and lightweight.
Mainly for (RDF) data publishers.
Includes necessary information of the
process, its inputs, and outputs.
Basis is simple: An agent runs a process on a
data (or dataset) to get another data (or
dataset).
Agent → Process → Data → Data’ .
@prefix voidp:
<http://purl.org/void/provenance/ns> .
66. voidp Classes and Predicates
voidp:ProvenanceEvent:items under provenance
control.
voidp:actor: actor, person, group, software or physical
artifact, involved in this provenance event.
voidp:certification:used to contain dataset’ signature
elements
voidp:contact: contact details of whom to contact should
people have queries about this dataset.
voidp:item:the provenance characteristics of a data item
under provenance control.
voidp:processType: the type of transformation or conversion
procedure carried out on the item’s source
voidp:resultingDataset: dataset that is the result of this
provenance event.
voidp:sourceDataset: source dataset for the data item under
provenance control.
http://www.planningalerts.com/: Email alerts of planning applications near a location. Data from screen scraping some local UK councils’ websites,http://ishortman.com/projects/expendituremap/:map of public expenditure data by UK region. Services such as defence, public order, science and technology, agriculture, and transport. Data based on normalised spreadsheet data from the UK’s Office for National Statistics Annual Abstract of Statistics.
Common tasks involved in the publish linked data, following presentation will give a brief overview of the each stage.
Linked data is mainly composed of its Publication, i.e. making your linked data available to the public, and Consumption, for others to consume and use it.
Uses standard SW technologies (RDF, OWL, SPARQL)Uses Garlik JXT triplestore
Be clear of the questions you are asking:
Data normalisation. Data sources in different formats.RDF/Turtle for its compactness and clarity.Data conversion to RDF. We used python scripts and Java (Jena) to convert the files to RDF.Modelling the datasets:Much of the data were multi-dimensional, so we used SCOVO to model these.
Modelling the Home Office datasets:Each row consists of Police Force data. Columns of each row contains crime values for offences such as “Violence against the person”, “Robbery”, “Offence Against vehicles”. We modelled the time period (2008/09), the geo regions, and the different crime types as “scovo:Dimension”.
Very difficult to integrate data from disparate sources
Asserted owl:sameAs relations between the geographic concepts of the datasets and the corresponding relevant entities in the O.S. Admin Geography (using string matching).