A method for identifying incorrect sameAs links on the Linked Open Data cloud
Details published in:
John Cuzzola, Ebrahim Bagheri, Jelena Jovanovic:
Filtering Inaccurate Entity Co-references on the Linked Open Data. DEXA (1) 2015: 128-143
Filtering Inaccurate Entity Co-references on the Linked Open Data
1. NEXT: Background
SCID: Semantic Co-reference Inaccuracy Detection - [ INTRODUCTION ]
Filtering Inaccurate Entity Co-references on the Linked Open Data
John Cuzzola, Jelena Jovanovic, Ebrahim Bagheri
bagheri@ryerson.ca
DEXA 2015
2. The Linked-Open-Data (LOD)
cloud represents hundreds of
available datasets throughout
the Web.
❖ 570 datasets and 2909 linkage
relationships between the datasets.1
1. http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/
NEXT: How are datasets linked?
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
3. To utilize the data from
multiple ontologies within
the LOD, “equivalence”
relationships between
concepts is necessary (ie:
the “edges” or linkages of
the LOD must be defined).
570 datasets and 2909 linkage
relationships between the datasets.
?
Equivalence relationships
between DBPedia and
Freebase?
NEXT: The sameAs predicate
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
4. The equivalency relationship is often accomplished via the predicate owl:sameAs
<owl:sameAs>
http://rdf.freebase.com/ns/en.doghttp://dbpedia.org/resource/Dog
NEXT: sameAs linkage mistakes
SCID: Semantic Co-reference Inaccuracy Detection - [ BACKGROUND ]
5. <owl:sameAs>
http://rdf.freebase.com/ns/en.bitchhttp://dbpedia.org/resource/Dog
NOT the same!
X
ns:common.topic.description:
"Bitch, literally meaning a female
dog, is a common slang term in the
English language, especially used
as a denigrating term applied to a
person, commonly a woman”
dbo:abstract:
The domestic dog (Canis lupus
familiaris) is a usually furry,
carnivorous member of the canidae
family.
The Problem: There are many incorrect
LOD linkages using owl:sameAs.
The Effect: Incorrect (embarrassing)
assertions by reasoners that use the LOD.
Example:
(from http://www.sameas.org)
NEXT: SCID
SCID: Semantic Co-reference Inaccuracy Detection - [ PROBLEM / MOTIVATION ]
6. SCID: Semantic Co-reference Inaccuracy Detection
❖ A method of natural language
analysis for detecting incorrect
owl:sameAs assertions.
1. Construct a baseline comparison vector vb(x,Sx).
2. For each resource (1,2,...) claiming to be the “same”,
construct vectors v1(x1,Sx), v2(x2,Sx) …
3. Compare individual distances from v1(x1,Sx),
v2(x2,Sx) … to baseline vb(x,Sx)
4. Disregard those v1(x1,Sx), v2(x2,Sx) … that are
outside some threshold distance δ.
NEXT: The core functions of SCID.
UPCOMING: How is vb(x,Sx) and v1(x1,Sx), v2(x2,Sx) … made?
SCID: Semantic Co-reference Inaccuracy Detection - [ CONTRIBUTION ]
7. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
SCID depends on two key functions:
1. A category distribution function: ρ(t,S) .
Given some natural language text (t) and a set of “suitable” subject categories (S) for t,
compute a distribution vector of how t relates to each subject category of S.
1. A category selection function S(uri).
Given a resource (uri), return a “suitable” set of subject categories (S) that can be used
in ρ(t,S).
NEXT: The category distribution function.
UPCOMING: The category selection function.
8. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
The category distribution function:
ρ(t,S) .
Ex: Given input text (t) as shown and three
DBpedia subject categories of S=[Fruit, Oranges,
Color] ρ(t,S) produces output:
ρ(t,[Fruit, Oranges, Color]) = v1(x1,S)
= [ 0.27Fruit, 0.50Oranges, 0.22Color ]
NEXT: The category selection function
UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
● Computes Rx,k defined as the importance of word x to category k for every word in t.
○ uses 5 features: (1) count of x in k, (2) count of x across all k, (3) count of concepts where
word x appears, (4) ratio of x in k to vocabulary of all k, (5) average word frequency of x per
resource in k.
9. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
The category selection function: Suri .
⇶ DBpedia contains 656,000+ category:subjects.
How do we select a few suitable for ρ(t,S)?
1. Begin with a candidate resource (uri):
http://dbpedia.org/resource/Orange_(fruit)
2. Find a DBpedia disambiguation page:
http://dbpedia.org/resource/Orange_(disambiguation)
3. Combine (union of) the subject categories for each of these resources.
Suri = [ category: { Optical_Spectrum, Oranges, Citrus_hybrids, Tropical_agriculture,
American_punk_rock, Rock_music, Hellcat_Records } ]
NEXT: How do we compute v1(x1,S) for sameAs inaccuracy filtering?
UPCOMING: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
dbr:Orange_(colour)
dbr:Orange_(fruit)
dbr:Orange_(band)
category:Optical_Spectrum category: Oranges
category:Citrus_hybrids
category:Tropical_agriculture
...
category: American_punk_rock
category: Rock_music
category: Hellcat_Records
10. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
NEXT: How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
UPCOMING: Experimental results
http://www.sameas.org
● dbr:Port
● www.w3.org:synset-seaport-noun-1
● rdf.freebase.com:en.port
● sw.opencyc.org:Seaport
● rdf.freebase.com:River_port
● dbr:Bad_conduct
● rdf.freebase.com:en.military_discharge
● dbr:IVDP
● rdf.freebase.com:en.port_wine
How do we compute v1(t1,S), v2(t2,S), .. for sameAs inaccuracy filtering?
1. Start with a group of resources that are identified as sameAs:
Ex: http://dbpedia.org/resource/Port (dbr:Port)
2. Collect subject categories Sdbr:Port using category selection function.
3. For each of the sameAs resources, collect natural language text (t)
describing the resource. Collect (t) using dbpedia rdfs:comment,
freebase ns.common.topic.description, www3.org wn20schema:gloss.
4. Compute vectors v1(t1,Sdbr:Port), v2(t2,Sdbr:Port)..., t1= rdfs:comment of
dbr:Port, t2= ns.common.topic.description of rdf.freebase:River_port, …
using category distribution function ρ(t,Sdbr:port).
We now have individual v1,2..(t1,2..,Sdbr:Port) vectors. Only need base
vector vb(tb,S) for comparison.
11. SCID: Semantic Co-reference Inaccuracy Detection - [ METHOD ]
NEXT: Experimental results
UPCOMING: Conclusion
How is baseline vector vb(x,Sx) computed and compared to v1(x1,S)?
1. Retrieve subject:categories of candidate resource from DBpedia
Ex: http://dbpedia.org/resource/Port (dbr:Port)
2. Find (all) other resources that use the categories of the candidate resource. Concatenate
rdfs:comment from all these resources (t).
3. Compute vb(t,Sdbr:Port) using category distribution function ρ(t,Sdbr:port).
● We now have base vector vb(t,Sdbr:Port) and can be compared to
individual sameAs vectors v1,2(x1,2,Sdbr:Port).
● We use Pearson Correlation Coefficient (PCC) to compare vectors.
● Remove vectors whose PCC less than threshold δ.
http://www.dbpedia.org
category:Nautical_terms
category:Ports_and_harbours
12. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
NEXT: Experimental results continued.
UPCOMING: Conclusion
● We examined 7,690 resources obtained from www.sameAs.org database of five topics:
○ Animal, City, Person, Color, and Miscellaneous.
● We performed some data cleansing on these resources.
○ removal of: duplicate resources (ie: aliases/redirects), broken links, redundant resources (ie: dbpedialite
is a subset of DBpedia).
● After cleansing 411 unique resources remained with 251 errors identified by human oracle
○ ie: http://dbpedia.org/resource/Dog is not the same as http://rdf.freebase.com/ns/en.bitch
13. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
● We computed v411(t,S) individual vectors for all 411 resources with associated baseline comparison
vector.
● We computed Pearson Correlation
between v411(t,S) and baseline.
● Removed identity links based on thresholds
ranging from 0.0 to 0.90. F-score
calculated.for each threshold used.
○ Original 411 resources contained
160 correct / 251 incorrect sameAs
links (0.560 F-score)
○ Threshold (δ) of 0.50 and 0.60 gave
best F-score.
NEXT: Experimental results continued.
UPCOMING: Conclusion
14. SCID: Semantic Co-reference Inaccuracy Detection - [ EXPERIMENTATION ]
Scatter plot of F-score versus Pearson Correlation Coefficient for oracle-identified right(blue) and
wrong(red) identity links.
PEARSON wrong right
δ
NEXT: Conclusion
15. SCID: Semantic Co-reference Inaccuracy Detection - [ CONCLUSION ]
-- END --
● In this presentation:
○ we introduce SCID: A technique for discovering inaccuracies in identity links assertion
(owl:sameAs).
○ Experimental results indicate SCID can identify incorrect identity link assertions and improve
precision of an identity database (http://www.sameas.org).
● In the future:
○ Experimentation with identity links other than owl:sameAs (ie: skos:closeMatch,
skos:exactMatch, owl:equivalentClasses).
○ Experimentation with vector comparison methods other than Pearson Correlation (ie: cosine
similarity, euclidean distance, Spearman rank coefficient).