Graph-based Ontology Analysis in the Linked Open Data
1. Graph-based Ontology Analysis in the Linked Open Data
Lihua Zhao, Ryutaro Ichise
September 5, 2012, I-Semantics2012, Graz, Austria
2. Outline
Introduction
Related Work
Our Approach
Graph Pattern Extraction
<Predicate, Object> Collection
Related Classes and Predciates Grouping
Integration for All Graph Patterns
Manual Revision
Experiments
Experimental Data
Graph Patterns of Linked Instances
Class-level Analysis
Predicate-level Analysis
Comparison with Previous Work
Conclusion and Future Work
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 2
3. Introduction
Linked Open Data (LOD)
295 data sets, 31 billion RDF triples (as of Sep. 2011).
Interlinked instances (owl:sameAs).
Linked
LOV User Slide- tags2con
Audio
Feedback share2RDF delicious
Moseley Scrobbler Bricklink Sussex
Folk (DBTune) Reading St.
GTAA
Magna-
Klapp- Lists Andrews
tune Resource
stuhl- NTU
DB club Lists Resource
Tropes Lotico Semantic yovisto
John Music Man- Lists
Music Tweet chester
Hellenic Peel Brainz NDL
(DBTune) (Data Brainz Reading subjects
FBD (zitgist) Lists Open
EUTC Incubator) Linked
Hellenic Library t4gm
Produc- Crunch- Open
PD tions Surge RDF info
base Library
Radio Discogs ohloh Ontos Source Code
Crime (Data Plymouth (Talis)
News Ecosystem Reading RAMEAU LEM
Reports business. Incubator)
Portal
Crime Linked Data Lists SH
UK data.gov. Music Jamendo
(En- uk
AKTing) Brainz (DBtune) Linked
Ox FanHubz gnoss ntnusc
(DBTune) SSW LCCN
Points
Last.FM Poké- Thesau- Thesau-
Popula- artists pédia Didac- rus rus W LIBRIS
tion (En- (DBTune) Last.FM talia theses. LCSH Rådata
reegle AKTing) research. patents. MARC
data.gov. data.gov. (rdfize) my fr Codes nå!
NHS uk uk Good- Experi- List
Ren. Classical
Energy
(En- win flickr ment
(DB Pokedex Family wrappr Norwe-
Genera- AKTing) Mortality BBC Sudoc PSH
Tune) gian
tors (En- Program-
AKTing) MeSH
mes semantic IdRef GND
CO2 education. OpenEI BBC web.org
Energy SW Sudoc ndlna
Emission data.gov. Music Dog VIAF
EEA (En- uk Chronic- Linked
(En- Food
AKTing) ling Event MDB Portu- UB Mann-
AKTing) Europeana
America Media guese heim
BBC DBpedia Calames
Recht- Wildlife Deutsche
Ord- Revyu DDC
Open Openly spraak. Finder Bio-
Election nance lobid
Local graphie NSZL
Data legislation Survey nl Tele- RDF Book data Ulm Resources Swedish
Project data.gov.uk New Catalog
EU Insti- graphis Mashup bnf.fr Open
tutions York
URI Greek Open P20 Cultural
UK Post- Times Heritage
Burner DBpedia Calais
codes statistics. ECS Wiki lobid
GovWILD data.gov. Taxon iServe South- Organi-
uk LOIUS BNB
Concept ECS ampton sations
Brazilian Geo World BibBase STW GESIS
OS ECS
Poli- ESD Names Fact- South-
ampton (RKB
ticians stan- reference. book Budapest
dards
data.gov.uk Freebase EPrints
Explorer)
data.gov.
intervals NASA
uk Project OAI
Lichfield transport. (Data Incu- DBpedia data Pisa
Spen- Guten- dcs
data.gov. bator) Fishes berg RESEX Scholaro-
ding DBLP
ISTAT uk of DBLP meter
Immi- Scotland Geo (FU (L3S)
Texas Uberblic
gration Pupils & Species data- Berlin) DBLP IRIT
Exams Euro- dbpedia (RKB
London stat TCM open- ACM
lite Gene ac- Explorer) IBM NVD
Traffic Gazette (FUB)
Geo
Scotland TWC LOGD Eurostat Daily
DIT uk
Linked UN/
Data UMBEL Med ERA
Data LOCODE DEPLOY
Gov.ie CORDIS YAGO New-
lingvoj Disea-
(RKB some SIDER RAE2001 castle LOCAH
Explorer) Linked Eurécom
CORDIS Drug Roma
Eurostat Sensor Data CiteSeer
(FUB) (Ontology Bank
GovTrack (Kno.e.sis) riese Open Pfam Course-
Central) Enipedia LinkedCT
Cyc Lexvo ware
Linked UniProt PDB VIVO
EURES EDGAR ePrints dotAC
US SEC Indiana IEEE
(Ontology totl.net
(rdfabout)
Central) WordNet RISKS
(VUA) Taxo- UniProt
US Census EUNIS Twarql (Bio2RDF) HGNC
Semantic (rdfabout) Cornetto nomy VIVO
FTS XBRL PRO- ProDom STITCH Cornell LAAS
SITE NSF
Scotland KISTI
Geo- LODE
Geo-
graphy WordNet WordNet WordNet JISC
(W3C) (RKB Affy-
Climbing
Linked KEGG
SMC Explorer) SISVU metrix Pub Drug VIVO UF
Piedmont GeoData PubMed ECCO- Media
Finnish Journals Gene SGD Chem
Accomo- TCP
Munici- dations El Viajero Ontology
palities Alpine AGROVOC bible
Tourism Ski ontology Geographic
Austria
KEGG
Ocean Enzyme PBAC
GEMET ChEMBL
Italian Drilling Metoffice OMIM KEGG
AEMET Weather Open Publications
public Codices Linked MGI Pathway
Forecasts Data InterPro GeneID
schools
EARTh Thesau- Open KEGG
Turismo rus Colors Reaction User-generated content
de
Zaragoza Product Smart KEGG
Weather DB Link Medi Glycan
Janus Stations Product Care KEGG
Government
AMP UniParc UniRef UniSTS
Types Italian
Homolo-
Com-
Yahoo! Airports Ontology Museums pound
Google Gene Cross-domain
Geo
Art
Planet National wrapper
Chem2
Radio- Bio2RDF
activity Uni Life sciences
JP Sears Open Linked OGOLOD Pathway
Corpo- Amster- Reactome
dam medu- Open
rates Numbers
Museum cator
As of September 2011
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 3
4. Challenging Problems
Infeasible to understand all the ontology schema of linked data sets.
Ontology heterogeneity problem
Heterogeneous ontology classes
DBpedia: http://dbpedia.org/ontology/Country.
Geonames: http://www.geonames.org/ontology#A.PCLI.
LinkedMDB: http://data.linkedmdb.org/resource/movie/country.
Heterogeneous ontology predicates
http://dbpedia.org/property/populationTotal.
http://dbpedia.org/property/population.
Time-consuming and infeasible to inspect large ontologies
Misuse of classes and predicates
DBpedia: 320 classes and thousands of predicates.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 4
5. Solution for the Problems
Automatically or semi-automatically integrate different ontologies
by analyzing interlinked instances.
Semi-automatic ontology integration
Reduce the ontology heterogeneity.
Identify important ontology classes and predicates that link instances.
Easy to understand simple integrated ontology.
Simplify the queries on various data sets.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 5
6. Related Work
Find useful attributes from frequent graph patterns. [Le, et al.,
2010]
Only for geographic data.
Analysis of basic predicates of SameAs network, Pay-Level-Domain
network and Class-Level Similarity network. [Ding, et al., 2010]
Only frequent types are considered to analyze how data are connected.
A debugging method for mapping lightweight ontologies. [Meilicke,
et al., 2008]
Limited to the expressive lightweight ontologies.
Construct intermediate-layer ontology from geospatial, zoology, and
genetics data resources. [Parundekar, et al., 2010]
Only for specific domains and only considers at class-level.
Construct an integrated mid-ontology from DBpedia, Geonames,
and NYTimes. [Zhao, et al., 2011]
Needs a hub data set and only considers at predicate-level.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 6
7. Our Approach
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 7
8. Step 1: Graph Pattern Extraction
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 8
9. Graph Pattern Extraction
Extract graph patterns from interlinked instances to discover
related ontology classes and predicates.
SameAs Graph SG = (V, E, I), V is a set of labels of data sets, E
⊆ V × V, I is a set of URIs of the interlinked instances.
Example: SGAustria = (V, E, I)
V = {D, G, N, M}
E = {(D,G), (D,N), (G,N), (G,M)}
I = { db:Austria, geo:2782113, nyt:66221058161318373601,
mdb-country:AT}.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 9
10. Step 2: <Predicate, Object> Collection
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 10
11. <Predicate, Object> Collection
An instance has a collection of <subject, predicate, object>.
(instance URI → subject, property → predicate, class → object)
<predicate, object> (PO) pairs as the content of a SameAs Graph.
Classify PO pairs into five types
Class: rdf:type and skos:inScheme.
Date: XMLSchema:date, gYear, gMonthDay, etc.
Number: XMLSchema:integer, int, float, double, etc.
URI: starts with “http://” and XMLSchema:anyURI.
String: XMLSchema:string and Others.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 11
12. An Example of Collected PO pairs
Table: PO pairs and types for SGAustria
Predicate Object Type
rdf:type owl:Thing Class
rdf:type db-onto:Place Class
rdf:type db-onto:PopulatedPlace Class
rdf:type db-onto:Country Class
rdfs:label “Austria”@en String
db-onto:wikiPageExternalLink http://www.austria.mu/ URI
db-prop:populationEstimate 8356707 Number
...... ...... ......
geo-onto:name Austria String
geo-onto:alternateName “Austria”@en String
geo-onto:alternateName “Republic of Austria”@en String
geo-onto:featureClass geo-onto:A Class
geo-onto:featureCode geo-onto:A.PCLI Class
geo-onto:population 8205000 Number
...... ...... ......
rdf:type mdb:country Class
mdb:country name Austria String
...... ...... ......
skos:inScheme nyt:nytd geo Class
skos:prefLabel “Austria”@en String
nyt-prop:first use 2004-10-04 Date
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 12
13. Step 3: Related Classes and Predicates Grouping
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 13
14. Related Classes Grouping
Group related classes from each SameAs Graph by tracking
subsumption relations owl:subClassOf and skos:inScheme.
< C1 owl:subClassOf C2 > or < C1 skos:inScheme C2 > means the
concept of class C1 is more specific than the concept of class C2 .
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 14
15. Related Predicates Grouping
Perform pairwise comparison on <predicate, object> (PO) pairs to
find out related predicates (properties).
Discover related predicates using different methods for the
types of Date, URI, Number, and String.
Date, URI: exact matching.
Number, String: exact matching + similarity matching.
Exact matching on PO pairs to create initial sets of PO pairs.
If OPOi = OPOj or PPOi = PPOj
⇒ Sk ← POi , POj
OPO : the object of PO.
PPO : the predicate of PO.
S: Initial set of PO pairs.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 15
16. Related Predicates Grouping
Similarity matching on PO pairs of type Number and String.
Similarity between POi and POj .
ObjSim(POi , POj ) + PreSim(POi , POj )
Sim(POi , POj ) =
2
Merge similar initial sets Si and Sj .
if Sim(POi , POj ) ≥ θ, where POi ∈ Si , POj ∈ Sj
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 16
17. Related Predicates Grouping
Similarity of objects between two PO pairs.
|OPOi −OPOj |
1− OPOi +OPOj if OPO is Number
ObjSim(POi , POj ) =
StrSim(OPOi , OPOj ) if OPO is String
OPO : the object of PO.
StrSim(OPOi , OPOj ): the average of the three string-based similarity
values JaroWinkler, Levenshtein distance, and n-gram.
Similarity of predicates between POi and POj
PreSim(POi , POj ) = WNSim(TPOi , TPOj )
TPO : the pre-processed terms of the predicates in PO.
WNSim(TPOi , TPOj ): the average of the nine applied WordNet-based
similarity values.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 17
18. Step 4: Integration for All Graph Patterns
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 18
19. Integration for All Graph Patterns
Groups of related classes and predicates are independent for each
graph pattern. Hence, we integrate them for all the graph patterns
to construct an integrated ontology.
Select terms for integrated ontology.
ex-onto:ClassTerm: select one concept from a set of classes.
ex-prop:propTerm: select one concept from a set of predicates.
Construct relations.
ex-prop:hasMemberClasses: link sets of classes with
ex-onto:ClassTerm.
ex-prop:hasMemberDataTypes: link sets of predicates with
ex-prop:propTerm.
Construct an integrated ontology.
Sets of related classes and predicates.
Selected terms: ClassTerm and propTerm.
Constructed relations.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 19
20. Step 5: Manual Revision
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 20
21. Manual Revision
Minor revision process on the automatically constructed ontology.
Modify incorrect terms
Not all the terms of classes and predicates are properly selected.
Add domain information
About 40% of the predicate sets lack of rdfs:domain information.
Modify incorrectly grouped classes and predicates
We can not guarantee 100% accuracy.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 21
22. Experiments
Analyze the characteristics of linked instances with the integrated
ontology constructed with our approach.
Experimental Data
Graph Patterns of Linked Instances
Class-level Analysis
Predicate-level Analysis
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 22
23. Experimental Data
DBpedia: cross-domain, 3.5 million things, 8.9 million URIs.
Geonames: geographical domain, 7 million URIs.
NYTimes: media domain, 10,467 subject news.
LinkedMDB: media domain, 0.5 million entities.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 23
24. Graph Patterns of Linked Instances
13 graph patterns
Frequent graph patterns:
GP1, GP2, GP3
N,G,D: GP4, GP5, GP7, GP8
N,M,D: GP6
M,G,D: GP9
M,D,N,G: GP10, GP11,
GP12, GP13
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 24
25. Class-level Analysis
Successfully integrated related classes from extracted graph patters.
Characteristics of graph patterns
Class Type Graph Pattern
Actor GP2 , GP6
Person(Athlete, Politician, etc) GP3
Organization/Agent GP1 , GP3 , GP8
Film GP2
City/Settlement GP1 , GP4 , GP5 , GP7 , GP8
Country GP9 , GP10 , GP11 , GP12 , GP13
Place(Mountain, River, etc) GP1 , GP3 , GP7
Integrated 97 classes into 48 groups
Example: ex-onto:Country
db-onto:Country geo-onto:A.PCLI
mdb:country nyt:nytd geo
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 25
26. Class-level Analysis
Discover missing class information
Example: db:Shingo Katori
db:Shingo Katori rdf:type dbpedia-owl:MusicalArtist.
mdb-actor:27092 owl:sameAs db:Shingo Katori
Therefore, db:Shingo Katori rdf:type db-onto:Actor.
Main classes of each data set.
NYTimes: person, organization, and place.
LinkedMDB: movie, actor, and country.
Geonames: A(country, administrative region), P (city, settlement), T
(mountain), S (building, school), and H (Lake, river).
DBpedia: person (artist, politician, athlete), organization (company,
educational institute, sports team), work (film), and place (populated
place, natural place, architectural structure).
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 26
27. Predicate-level Analysis
Integrated 367 predicates into 38 groups
Example: ex-prop:birthDate
Predicate Number of Instances
db-onto:birthDate 287,327
db-prop:datebirth 1,675
db-prop:dateofbirth 87,364
db-prop:dateOfBirth 163,876
db-prop:born 34,832
db-prop:birthdate 70,630
db-prop:birthDate 101,121
Recommend standard predicates
<db-onto:birthDate, rdfs:domain, db-onto:Person>
“db-onto:birthDate” has the highest frequency of usage
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 27
28. Comparison with Previous Work
Compare our ontology integration approach with the mid-ontology
approach [Zhao, et al., JIST2011].
Mid-Ontology approach Our approach
A hub data for data collection. No hub data.
String-based similarity measures Different similarity measures for
for all types of objects. different types of objects.
105 predicates in 22 groups. 367 predicates into 38 groups.
No classes 97 classes into 48 groups
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 28
29. Conclusion and Future Work
Conclusion
Integrate heterogeneous ontologies from various data sets.
Identify the characteristics of graph patterns using the integrated
ontology classes.
Recommend standard predicates using the integrated ontology
predicates.
Reduce the heterogeneity of ontologies.
Construct an integrated ontology without learning the entire ontology
schema.
Future Work
Use more data sets in the LOD cloud.
Apply MapReduce method to solve scalability and ontology
heterogeneity problem.
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 29
30. Questions?
Lihua Zhao, lihua@nii.ac.jp
Ryutaro Ichise, ichise@nii.ac.jp
Lihua Zhao, Ryutaro Ichise | Graph-based Ontology Analysis in the Linked Open Data | 30