2. about me
• bachelor and PhD from Politehnica University of Bucharest
• Master in Data Mining from University Lumiere of Lyon
• research on competence management, semantic web, e-learning
• business on career management and recruiting
• now fellow of the Romanian American Foundation at the University of
Rochester (Fulbright scholarship starting with 2017) for developing
entrepreneurship in Romania
3. Linked Data and Open Data
• linked data = a way to connect data on the web using URIs and RDF,
the most successful result of the Semantic Web initiative
• open data = Open data is data that can be freely used, re-used and
redistributed by anyone - subject only, at most, to the requirement to
attribute and sharealike.
• open government data = data regarding public institutions, published
on governmental sites
4. Very Short Intro on RDF
• data represented as
statements
• statements contain
• subject
• predicate
• object
• subject, predicate and
sometimes objects are URIs
• URIs are used to uniquely
identify entities or properties
6. Why Do We Need Open Data?
• transparency
• how does the government spend money
• fuel innovation and entrepreneurship
• https://www.youtube.com/watch?v=sUqY5ySylXg (Todd Park discussing
benefits of Open Government Data)
• opening weather data and GPS data allowed people to build businesses
• “last year alone civilian and commercial access to GPS created 90 billion $
worth of value” (2013)
• participatory governance
• citizens enabled in decision making
• “making a full read/write society” (http://opengovernmentdata.org/why/)
7. Open Data Quality
• five stars of open data proposed by Tim Berners Lee
• (1) be available on the Web under an open licence,
• (2) be in the form of structured data,
• (3) be in a non-proprietary file format,
• (4) use URIs as itsidentifiers (see also RDF),
• (5) include links to other data sources (see linked data).
http://opendatahandbook.org/glossary/en/terms/five-stars-of-open-data/
8. Open Data in the World
• Global Open Data Quality measures how governments implement Open
Data
• evaluates if a country posted data on
• national statistics
• government budget
• government spending
• legislation
• election results
• national map
• pollution
• also evaluates the quality of the posted data
• companies
• location datasets
• government procurement
• water quality
• weather forecast
• land ownership
• transport timetables
• health performance
9. Global Open Data Quality
relevant progress has been made in terms of
opening data
scores would be much lower if 5 star data
would have a bigger weight
http://index.okfn.org/place/
http://index.okfn.org/methodology/
10. Open Data in the US
• data.gov – 190k datasets
• mostly html (70k)
• RDF below 5% of the total number of datasets
• more than a quarter are either pdf, jpg, tiff
• relevant steps
• data.gov launched in 2009
• Open Government Partnership 2011 (http://www.opengovpartnership.org/)
• Digital Accountability and Transparency Act (2014)
• creating publishing standards for public spending data
https://max.gov/maxportal/assets/public/offm/DataStandardsFinal.htm
11. Open Data in Saint Louis
• https://www.stlouis-mo.gov/data/ - list of data sets
• most of them html or pdf
• some confuse open data with reports
12. Open Data in Romania
• Data.gov.ro
• National portal where public institutions put all the data
• Types of resources published: CSV (***), PDF(*), XLS (**)
• There is no connection between files (zero files with 4 or 5 *)
• September 2016:
• 72 public institutions
• 8185 files
• Each file can have its own structure
• uses CKAN (http://ckan.org/)
13. Why do we need Linked Open Data
• classic workflow when working with open data:
• analyze CSV files
• define own data model
• import data from CSV files into data model
• solve import problems (naming differences, character encoding issues)
• identify entities and link them to other entities existing in the model
• link data from different CSV files in a common model
• extract relevant information
• write the program logic to exploit the data
14. Why Do We Need Linked Open Data
• classic workflow when working with linked open data:
• analyze models
• write query to extract relevant information
• write the program logic to exploit the data
• can use directly more than one dataset by performing “joins” in the
queries
• much faster to develop an application
• much easier to reuse data
15. Linked Open Data in Romania
• Our goal is to transform open data from Romania into Linked Open
Data.
• Transform data into RDF triples (Subject, Predicate, Object)
• Link entities with existing online resources, especially from dbpedia.org and
Geonames
• Create a platform where each published file is transformed into RDF
• Create rich applications using SPARQL queries
16. Vision
• create tools and workflows to allow non-technical users to add Linked
Data to the government website
• offer an API for developers who want to create apps based on open
government data
• integrate the software into CKAN (the open data portal used by most
governments) to allow every government to create linked data
17. Stages
1. modeling data
2. massively transforming data
3. linking data to external data sets
4. embed into CKAN
18. First Stage – Modeling Data
• Identify the most common ontologies used
• Create naming rules for creating the same URIs that identify the same
resources
• Identify the most common properties of the open data and the
ontological properties associated to them
• Identify the most common naming problems
• different encodings
• different spelling
• different lexicalization of the same concepts
and write hacks to solve them
19. Open Data Types
• Numerical data:
• Different budgets or revenues
• Different statistical data: number of cars/type/year, number of bed/hospital
• Etc.
• “Plain” data:
• Information about entities:
• Lawyers, Schools, Pharmacies, Museums, Archeological sites
• Etc.
• Found in tabular files, such as CSV or XLS
21. Most Common Vocabularies for Open Data
• Dublin Core (DCTerms, DCE) – describes metadata terms
(http://dublincore.org/schemas/)
• SKOS – Simple Knowledge Organisation Systems – representing
taxonomies
• FOAF – Friend of a Friend – representing people and the relations
between them
• CC – Creative Commons – copyright information
• GEO – Geonames – data about locations
• VANN – data about vocabularies
• DBPedia –
22. Ontologies used
• We used especially OWL classes defined by dbpedia such as:
• http://dbpedia.org/ontology/Location
• http://dbpedia.org/ontology/Place
• http://dbpedia.org/ontology/PopulatedPlace
• http://dbpedia.org/ontology/Museum
• http://dbpedia.org/ontology/Hospital
• http://dbpedia.org/ontology/EducationalInstitution
• http://dbpedia.org/class/yago/CitiesInRomania
• Other used OWL classes:
• http://umbel.org/umbel/rc/Village
• https://schema.org/PostalAddress
23. Naming rules for creating URIs
• Each URI has as prefix: http://opendata.cs.pub.ro/resource
• Our goal is to make URIs for each resource as easy to understand as
possible for humans
• Our statement is: Once you read the URI, you know what it is about
• For example:
• Locality: http://opendata.cs.pub.ro/resource/<localityName>_judet_<localityCounty>
• Hospital: http://opendata.cs.pub.ro/resource/<hospitalName>_hospital_<localityCounty>
24. Most common properties
• Most used properties were taken from well-known vocabularies such
as:
• VCARD: vcard:region, vcard:locality
• FOAF: foaf:mbox, foaf:fax
• GEO: geo:lat, geo:long
• Other properties were taken from those defined by dbpedia.org:
• http://dbpedia.org/property/postcode
• http://dbpedia.org/property/phonenumber
• We also defined properties defined in our own namespace:
• http://opendata.cs.pub.ro/property
25. Most common naming problems and how we
solve them
• Resource’s name was written with diacritics
• Replace diacritics with normal letter
• Resource’s name was written with non-alphanumeric characters, such as:
space, hyphen, comma
• Replace them with underscore
• After initial choosing the naming convention, we saw that there can be
some conflicts
• For example: we chose initial the URI for museums only the name of the museum,
but there can be a museum with same name in multiple towns, so we added for the
URI also the museum’s town
26.
27.
28. Stage 2: massively transforming data
• experiment with 20 students from a master’s class
• groups of 2 asked to choose 2 datasets and transform them
• + greatly increased the number of data transformed
• - big amount of work to correct the errors introduced by students
• have to involve a larger number of volunteers
• students will be asked to offer expert support
• 2016 result: 10 new datasets, more than 500000 triples added
29. Biggest problem PDF Files
• Unfortunately, there are tabular
data hidden in scanned PDF files
• We created an algorithm to extract
only the tables from these scanned
files
• This way, we transform the
unmanageable scanned files into
tabular ones
• We want to improve existing tools
using contextual information
regarding the type of document
30. Stage 3 Linking to external data sets
• most important datasets:
• dbpedia
• people
• events
• places
• geonames
• all the places
31. How to link?
• query and disambiguate
• sometimes really difficult
• disambiguation
• by type
• by context
• not always possible automatically
32. Relevant tool: SILK
• http://silkframework.org/
• Generating links between related data items within different Linked Data
sources.
• Linked Data publishers can use Silk to set RDF links from their data sources to
other data sources on the Web.
• Applying data transformations to structured data sources.
33. or write some code
geoloc=loc.decode("utf8")[:-1]
query=strip_accents("http://api.geonames.org/search?q=%s&maxRows=1&type=rdf&username=vladposea"%geoloc)
g=rdflib.Graph()
g.parse(query.encode("ascii","ignore"),format="xml", encoding="utf-8")
for s,p,o in g.triples( (None, rdflib.RDF.type, gn.Feature) ):
fullGraph.add((locURI,OWL.sameAs, s))
<rdf:RDF >
<gn:Feature rdf:about="http://sws.geonames.org/686254/">
...
34. Stage 4 – Embed into CKAN
• CKAN - http://ckan.org/
• tool for publishing data
• aimed at governments and other public organizations
• specially designed for open data
• used internationally
• not built for linked data
• we envision developing a plugin to semi-automatically construct
linked data from the open data published
35. What do we have so far?
• Our focus was on “plain” data:
• Cities dataset published in RDF and linked with geonames.org
• Each created resource has a <owl:sameAs> property that links to geonames.org
• Schools dataset published in RDF
• Pharmacies dataset published in RDF
• Museums dataset published in RDF and linked with dbpedia.org
• Churches dataset publishe in RDF and linked with dbpedia.org
• 207382 URIs with overall 2683968 RDF triples
36. How we transformed the data?
• For each dataset:
• We identified what vocabulary should be used for each property
• We identified what additional properties should be created for each resource
• Each physical entity has an address and using Google Geocode service we obtained the
geographical coordinates for that address
• We created one unique URI for each resource
• We generated the URIs by putting a lot of information inside them for example the URI
for one school is :http://opendata.cs.pub.ro/resource/<school_name>_<city>
• We opted for this encoding schema to create more verbose URIs, not just hashes
• We linked each possible resource using online semantic repositories, such as
dbpedia.org and geonames.org
• The linking is done by searching entities with the same type and name
37. How can someone access the resources?
• We have published all RDF triples in a semantic repository:
• http://opendata.cs.pub.ro/repo
• It supports SPARQL queries
• http://opendata.cs.pub.ro/repo/sparql
• We document all published datasets in :
• Blog: http://opendata.cs.pub.ro/blog
• Wiki: http://opendata.cs.pub.ro/wiki
38. SPARQL queries
Towns where there are no schools
SELECT ?loc
WHERE { ?loc rdf:type
<http://dbpedia.org/ontology/Settlement> .
FILTER NOT EXISTS { ?x
<http://opendata.cs.pub.ro/property/institutie_in_lo
calitate> ?loc . } }
Find the museums linked with dbpedia.org
SELECT ?MusRO ?MusDB
WHERE {
?MusRO rdf:type
<http://dbpedia.org/ontology/Museum>.
?MusRO owl:sameAs ?MusDB. }
ORDER BY ?MusRO
39. Example application
• All physical entities have an address and we obtained the
geographical coordinates of this address.
• We put on a map all these entities and someone can see the nearest
museums, hospitals or pharmacies from its location
• The app is online:
• http://opendata.cs.pub.ro:3000
40.
41. Technologies used in this project
• Storage layer:
• Apache Marmotta HEAD version
• Processing layer:
• JAVA using Apache POI for reading tabular data and Apache Jena for
converting data to RDF
• C with OpenCV and Tesseract for extracting tabular data from scanned PDF
files
• Visualization layer:
• Backend: node.js using sparql-client module for SPARQL queries
• Fronted: angular.js
42. Alternative Technologies
• Open Refine
• http://openrefine.org/
• formerly Google Refine
• allows to
• explore data in various formats
• clean and transform data (clustering, easy or scripted transformations)
• reconcile and match data
• supports external web services
43. Karma
• semantic mapping tool http://labs.europeana.eu/apps/karma
• imports data in various formats
• transforms it to semantic data
• links it to DBPedia or GeoNames
• no features for statistical data integration
• no features for parsing pdf files
44. Named Entity Recognition
• Named Entity Recognition – identify entities in texts, apply tags, link
to permanent entities
• Open Calais – up to 5k free requests/day
• http://www.opencalais.com/
• Alchemy – made by IBM
• http://www.alchemyapi.com/
• 1k/day free
45. Apache Marmotta
• http://marmotta.apache.org/
• read – write linked data server
• open implementation of W3C’s
Linked Data Platform
Recommendation
https://www.w3.org/TR/ldp/
• repository
• SPARQL 1.1 engine
46. RDF Data Cube Vocabulary
• statistical data can’t be expressed using just subject predicate and
attribute
• RDF – graph
• statistical data – hypergraph
• RDF Data Cube https://www.w3.org/TR/vocab-data-cube -
recommendation for a vocabulary to describe multi-dimensional data
• compatible with Statistical Data and Metadata eXchange - SDMX
47. Plan for the future
• Develop an automated way to choose the vocabulary for one dataset
• Focus on statistical data and publish them using RDF Data Cube
vocabulary
• Develop a more accurate method of linking resources
• Create more applications that use the published data
48. Papers
• LODRo: Using cultural Romanian open data to build new learning
applications
Octavian Rinciog, Vlad Posea, The International Scientific Conference eLearning and
Software for Education, Bucharest, 2016
• Publishing Romanian public health data as Linked Open Data
Octavian Rinciog, Vlad Posea, E-Health and Bioengineering Conference (EHB), Iasi,
2015
• The Semantic Representation of Open Data Regarding the Romanian
Companies
Marian Spoiala, Octavian Rinciog, Vlad Posea, RoEDU Conference, Bucharest, 2016
• GovLOD: Towards a Linked Open Data Portal
Octavian Rinciog, Vlad Posea, Poster in ISWC Conference, Tokyo, 2016