Links between datasets are an essential ingredient of Linked Open Data. Since the manual creation of links is expensive at large-scale, link sets are often created using heuristics, which may lead to errors. In this paper, we propose an unsupervised approach for finding erroneous links. We represent each link as a feature vector in a higher dimensional vector space, and find wrong links by means of different multi-dimensional outlier detection methods. We show how the approach can be implemented in the RapidMiner platform using only off-the-shelf components, and present a first evaluation with real-world datasets from the Linked Open Data cloud showing promising results, with an F-measure of up to 0.54, and an area under the ROC curve of up to 0.86.
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
1. 05/26/14 Heiko Paulheim 1
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim
2. 05/26/14 Heiko Paulheim 2
Motivation
• Dataset interlinks can be wrong for many reasons
– Oversimplified heuristic generation (e.g., label equality)
– owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.)
– Concept drift of link targets
• e.g., dbpedia:Prong used to denote a band until DBpedia 3.1
• now it's a disambiguation page
04/08/0812/04/07
<http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .
3. 05/26/14 Heiko Paulheim 3
Overall Idea
• Links between datasets follow certain patterns
– e.g., linking a mo:MusicArtist to a dbo:Artist,
and a mo:MusicalWork to a dbo:Album or a dbo:Song
• Wrong links violate those patterns
• Hence, outlier detection should find wrong links
– Definition: “finding patterns in data that do not conform to the expected
normal behavior” (Chandola et al., 2009)
• Difference over related approaches
– does not require the same schema used in both datasets
– nor schema mappings
– no external/human knowledge required
4. 05/26/14 Heiko Paulheim 4
Projection of Links into Vector Space
• Represent each link as a point in an n-dimensional vector space
– e.g., using their direct types
• Outliers are found in sparse areas
5. 05/26/14 Heiko Paulheim 5
Projection of Links into Vector Space
• Types
– each type of LHS and RHS resource becomes a binary (0/1) feature
– types on both sides are treated separately
• i.e., LHS_foaf:person and RHS_foaf:person
are distinct features
• Properties
– each ingoing/outgoing property of LHS and RHS resource
becomes a binary (0/1) feature
– properties on both sides are treated separately
– ingoing and outgoing properties are treated separately
• i.e., LHS_foaf:based_near, RHS_foaf:based_near,
foaf:based_near_LHS and foaf:based_near_RHS
are all distinct features
• Joint feature set of types and properties
6. 05/26/14 Heiko Paulheim 6
Experiments
• Datasets: link sets between
– BBC Peel Sessions and DBpedia (2,087 links)
– DBTropes and DBpedia (4,229 links)
• Gold standard
– 100 randomly sampled links from each set, manually evaluated
– Peel: 90 out of 100 are correct
– Tropes: 76 out of 100 are correct
• We run outlier detection on the whole link set
– and validate the output only on the gold standard
7.
8. 05/26/14 Heiko Paulheim 8
Experiments
• Outlier Detection Approaches
– assign a score (or label) to each data point
– the higher the score, the likelier it is an outlier
• Evaluation
– Ordering descending by outlier score
– Ideally, all outliers are above all non-outliers
– Plot a ROC curve to measure the quality
• i.e., AUC
– F-Measure
• with best possible threshold
9.
10. 05/26/14 Heiko Paulheim 10
Results
• Type features work better than property features
• LoOP delivers reliably good results
– though not the best
• Best performance on Peel dataset
– CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857)
• Best performance on DBTropes dataset
– LOF (F1 = 0.5, AUC = 0.619)
11. 05/26/14 Heiko Paulheim 11
Results
• ROC curves for Peel dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=10,25,50 identical, LoOP k=25,50 identical
12. 05/26/14 Heiko Paulheim 12
Results
• ROC curves DBTropes dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical,
CBLOF and LDCOF mostly identical
13. 05/26/14 Heiko Paulheim 13
Runtimes
• Most outlier detection algorithms are reasonably fast
– both linksets processed in less than 10 seconds on a normal laptop
• Exceptions:
– clustering (for CBLOF/LDCOF) takes up to 30 seconds
– 1-class SVM takes up to 15 minutes
• ...but creating the feature vector representation
takes much more time
– some hours against public SPARQL endpoint(s)
– reasonably fast with downloaded dumps
14. 05/26/14 Heiko Paulheim 14
Discussion of Results
• Results on Peel dataset better than on DBTropes dataset
• Projection based on types better than on properties
• most likely due to lower dimensionality of vector space
• Peel: #types = 34, #properties = 60
• DBTropes: #types = 81, #properties = 142
• Variation of outlier detection algorithms across datasets
– also observed in other experiments
– general rules of thumb are hard to come up with
15. 05/26/14 Heiko Paulheim 15
Possible Improvements & Future Work
• Other projection methods
– e.g., using numeric counts of relations
• Other outlier detection algorithms
– e.g., Replicating Neural Networks and their generalizations
• Preprocessing
– e.g., Feature Subset Selection
– caveat: the valuable features are often sparse
16. 05/26/14 Heiko Paulheim 16
Possible Improvements & Future Work
• So far, we have looked at owl:sameAs links
• The approach is not limited to that
– should work for other link predicates as well
– e.g., a dataset of persons and a dataset of places
– linked by foaf:based_near
• It is not even limited to linksets
– also for debugging statements inside a knowledge base
– e.g., dbpedia-owl:deathPlace
17. 05/26/14 Heiko Paulheim 17
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim