Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

05/26/14 Heiko Paulheim 1
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim

Motivation
• Dataset interlinks can be wrong for many reasons
– Oversimplified heuristic generation (e.g., label equality)
– owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.)
– Concept drift of link targets
• e.g., dbpedia:Prong used to denote a band until DBpedia 3.1
• now it's a disambiguation page
04/08/0812/04/07
<http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .

Overall Idea
• Links between datasets follow certain patterns
– e.g., linking a mo:MusicArtist to a dbo:Artist,
and a mo:MusicalWork to a dbo:Album or a dbo:Song
• Wrong links violate those patterns
• Hence, outlier detection should find wrong links
– Definition: “finding patterns in data that do not conform to the expected
normal behavior” (Chandola et al., 2009)
• Difference over related approaches
– does not require the same schema used in both datasets
– nor schema mappings
– no external/human knowledge required

Projection of Links into Vector Space
• Represent each link as a point in an n-dimensional vector space
– e.g., using their direct types
• Outliers are found in sparse areas

Projection of Links into Vector Space
• Types
– each type of LHS and RHS resource becomes a binary (0/1) feature
– types on both sides are treated separately
• i.e., LHS_foaf:person and RHS_foaf:person
are distinct features
• Properties
– each ingoing/outgoing property of LHS and RHS resource
becomes a binary (0/1) feature
– properties on both sides are treated separately
– ingoing and outgoing properties are treated separately
• i.e., LHS_foaf:based_near, RHS_foaf:based_near,
foaf:based_near_LHS and foaf:based_near_RHS
are all distinct features
• Joint feature set of types and properties

Experiments
• Datasets: link sets between
– BBC Peel Sessions and DBpedia (2,087 links)
– DBTropes and DBpedia (4,229 links)
• Gold standard
– 100 randomly sampled links from each set, manually evaluated
– Peel: 90 out of 100 are correct
– Tropes: 76 out of 100 are correct
• We run outlier detection on the whole link set
– and validate the output only on the gold standard

Experiments
• Outlier Detection Approaches
– assign a score (or label) to each data point
– the higher the score, the likelier it is an outlier
• Evaluation
– Ordering descending by outlier score
– Ideally, all outliers are above all non-outliers
– Plot a ROC curve to measure the quality
• i.e., AUC
– F-Measure
• with best possible threshold

Results
• Type features work better than property features
• LoOP delivers reliably good results
– though not the best
• Best performance on Peel dataset
– CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857)
• Best performance on DBTropes dataset
– LOF (F1 = 0.5, AUC = 0.619)

Results
• ROC curves for Peel dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=10,25,50 identical, LoOP k=25,50 identical

Results
• ROC curves DBTropes dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical,
CBLOF and LDCOF mostly identical

Runtimes
• Most outlier detection algorithms are reasonably fast
– both linksets processed in less than 10 seconds on a normal laptop
• Exceptions:
– clustering (for CBLOF/LDCOF) takes up to 30 seconds
– 1-class SVM takes up to 15 minutes
• ...but creating the feature vector representation
takes much more time
– some hours against public SPARQL endpoint(s)
– reasonably fast with downloaded dumps

Discussion of Results
• Results on Peel dataset better than on DBTropes dataset
• Projection based on types better than on properties
• most likely due to lower dimensionality of vector space
• Peel: #types = 34, #properties = 60
• DBTropes: #types = 81, #properties = 142
• Variation of outlier detection algorithms across datasets
– also observed in other experiments
– general rules of thumb are hard to come up with

Possible Improvements & Future Work
• Other projection methods
– e.g., using numeric counts of relations
• Other outlier detection algorithms
– e.g., Replicating Neural Networks and their generalizations
• Preprocessing
– e.g., Feature Subset Selection
– caveat: the valuable features are often sparse

Possible Improvements & Future Work
• So far, we have looked at owl:sameAs links
• The approach is not limited to that
– should work for other link predicates as well
– e.g., a dataset of persons and a dataset of places
– linked by foaf:based_near
• It is not even limited to linksets
– also for debugging statements inside a knowledge base
– e.g., dbpedia-owl:deathPlace

Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim

Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

Recommandé

Recommandé

Contenu connexe

Similaire à Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

Similaire à Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection (20)

Plus de Heiko Paulheim

Plus de Heiko Paulheim (20)

Dernier

Dernier (20)

Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection