Type Inference on Noisy RDF Data

Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko

1

The Problem
•

One promise of the Semantic Web:
– You can issue structured queries
– e.g., „List all presidents that graduated from Harvard Law School“
– SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

10/31/13


2

The Problem
•

SELECT ?x WHERE {

•

...if we run this against DBpedia, we get one result
– i.e., Elwell Stephen Otis

•

But...

10/31/13


3

The Problem

10/31/13


4

The Problem
•

So what is going wrong?

•

SELECT ?x WHERE {

•

In DBpedia, Barack Obama is not of type President!

•

How can we add missing types?

10/31/13


5

Is It a Big Problem?
•

DBpedia has at least 2.7 million missing type statements
– w.r.t. the DBpedia ontology
– found using co-occurence analysis of matching classes
in YAGO and DBpedia
– a very optimistic lower bound

•

Highly incomplete classes:
– Species: >870,000 missing statements
– Person: >510,000 missing statements
– Event: >150,000 missing statements

10/31/13


6

A Naive Approach
•

Idea: exploit properties with domain and range

•

Pseudo RDFS Reasoning:
– CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION
{?y ?r ?x . ?r rdfs:range ?t} }

10/31/13


7

A Naive Approach
•

Experiment with Barack Obama
– Person, PersonFunction, Actor, Organization

•

Experiment with Germany:
– Place, Award, Populated Place, City, SportsTeam, Mountain, Agent,
Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company,
EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion,
Language, MilitaryConflict, Settlement, RouteOfTransportation

10/31/13


8

A Naive Approach
•

What is going on here?
– DBpedia data is noisy
– One wrong statement is enough for a wrong conclusion
– e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany

•

Germany example: 69,000 statements
– 20 wrong types can come from 20 wrong statements
– i.e., an error rate of 0.03% is enough for a totally screwed result
– ...but that would be an excellent data quality for a LOD source!

10/31/13


9

SDType Approach
•

Idea: outgoing/incoming properties are indicators
for a resource's type
– e.g.: starring → Movie
– e.g.: author-1 → Writer

•

Basic compiled statistics
– P(C|p) := probability of class C in presence of property p
– e.g.: P(dbpedia:Film|starring) = 0.79
– e.g.: P(dbpedia:Writer|author-1) = 0.44

10/31/13


10

SDType Approach
•

Based on precompiled statistics
– Find types of instances
– Using voting

•

score(C) = avg(all properties p) P(C|p)

•

Refinement:
– Weight for properties: discriminative power
– weight(p) = sum(all classes c) (p(c)-p(c|p))²
– i.e., how strongly this property's class distribution
deviates from the overall class distribution

10/31/13


11

Evaluation
•

Two fold evaluation
– On DBpedia and OpenCyc as „Silver Standard“
(automatic, 10,000 random instances)
– On untyped DBpedia resources (manual, 100 instances)

•

Using only incoming properties
– Using outgoing properties is trivial!

10/31/13


12

Evaluation Results
•

On DBpedia

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13


13

Evaluation Results
•

On OpenCyc

1
0.9
0.8

Precision

0.7
0.6
min. 1 link
min. 10 links
min. 25 links

0.5
0.4
0.3
0.2
0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

10/31/13


14

Evaluation Results
•

Evaluation on untyped resources
– Random sample of 100 untyped resources
– Manual checking of precision

1

12

0.9
10

0.8
0.7
Precision

0.6
0.5

6

0.4
4

0.3
0.2

# found types

8
# found
types
precision

2

0.1
0

0
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Lower bound for threshold

10/31/13


15

Evaluation Results
•

DBpedia:
– works reasonably well (F-measure 0.89)

•

OpenCyc:
– harder because of deeper class hierarchy (F-measure 0.60)

•

General:
– having more links increases precision
(in contrast to RDFS reasoning)
– more general types (e.g., Band) are easier than specific ones
(e.g., PunkRockBand)

10/31/13


16

Deployment
•

Heuristic types have been included in DBpedia 3.9
– for previously untyped instances
– 3.4 million type statements at precision ~0.95

•

Includes also many resources without a Wikipedia page
– i.e., generated from a red link

•

Runtime
– Complexity O(PT)
P: number of property assertions
T: number of type assertions
– ~24h for processing DBpedia

10/31/13


17

Conclusion and Outlook
•

SDType approach works at high quality
– outperforms most state of the art on DBpedia
– deployed for DBpedia 3.9

•

Same approach can be used for
– validating links
– within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)
– across datasets: to be done

10/31/13


18

Type Inference on Noisy RDF Data

10/31/13 Paulheim, Christian Bizer
Heiko

19

Type Inference on Noisy RDF Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (11)

Similaire à Type Inference on Noisy RDF Data

Similaire à Type Inference on Noisy RDF Data (20)

Plus de Heiko Paulheim

Plus de Heiko Paulheim (12)

Dernier

Dernier (20)

Type Inference on Noisy RDF Data