ICT role in 21st century education and its challenges
Type Inference on Noisy RDF Data
1. Type Inference on Noisy RDF Data
10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko
1
2. The Problem
•
One promise of the Semantic Web:
– You can issue structured queries
– e.g., „List all presidents that graduated from Harvard Law School“
– SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }
10/31/13
Heiko Paulheim, Christian Bizer
2
3. The Problem
•
SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }
•
...if we run this against DBpedia, we get one result
– i.e., Elwell Stephen Otis
•
But...
10/31/13
Heiko Paulheim, Christian Bizer
3
5. The Problem
•
So what is going wrong?
•
SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }
•
In DBpedia, Barack Obama is not of type President!
•
How can we add missing types?
10/31/13
Heiko Paulheim, Christian Bizer
5
6. Is It a Big Problem?
•
DBpedia has at least 2.7 million missing type statements
– w.r.t. the DBpedia ontology
– found using co-occurence analysis of matching classes
in YAGO and DBpedia
– a very optimistic lower bound
•
Highly incomplete classes:
– Species: >870,000 missing statements
– Person: >510,000 missing statements
– Event: >150,000 missing statements
10/31/13
Heiko Paulheim, Christian Bizer
6
7. A Naive Approach
•
Idea: exploit properties with domain and range
•
Pseudo RDFS Reasoning:
– CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION
{?y ?r ?x . ?r rdfs:range ?t} }
10/31/13
Heiko Paulheim, Christian Bizer
7
8. A Naive Approach
•
Experiment with Barack Obama
– Person, PersonFunction, Actor, Organization
•
Experiment with Germany:
– Place, Award, Populated Place, City, SportsTeam, Mountain, Agent,
Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company,
EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion,
Language, MilitaryConflict, Settlement, RouteOfTransportation
10/31/13
Heiko Paulheim, Christian Bizer
8
9. A Naive Approach
•
What is going on here?
– DBpedia data is noisy
– One wrong statement is enough for a wrong conclusion
– e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany
•
Germany example: 69,000 statements
– 20 wrong types can come from 20 wrong statements
– i.e., an error rate of 0.03% is enough for a totally screwed result
– ...but that would be an excellent data quality for a LOD source!
10/31/13
Heiko Paulheim, Christian Bizer
9
10. SDType Approach
•
Idea: outgoing/incoming properties are indicators
for a resource's type
– e.g.: starring → Movie
– e.g.: author-1 → Writer
•
Basic compiled statistics
– P(C|p) := probability of class C in presence of property p
– e.g.: P(dbpedia:Film|starring) = 0.79
– e.g.: P(dbpedia:Writer|author-1) = 0.44
10/31/13
Heiko Paulheim, Christian Bizer
10
11. SDType Approach
•
Based on precompiled statistics
– Find types of instances
– Using voting
•
score(C) = avg(all properties p) P(C|p)
•
Refinement:
– Weight for properties: discriminative power
– weight(p) = sum(all classes c) (p(c)-p(c|p))²
– i.e., how strongly this property's class distribution
deviates from the overall class distribution
10/31/13
Heiko Paulheim, Christian Bizer
11
12. Evaluation
•
Two fold evaluation
– On DBpedia and OpenCyc as „Silver Standard“
(automatic, 10,000 random instances)
– On untyped DBpedia resources (manual, 100 instances)
•
Using only incoming properties
– Using outgoing properties is trivial!
10/31/13
Heiko Paulheim, Christian Bizer
12
15. Evaluation Results
•
Evaluation on untyped resources
– Random sample of 100 untyped resources
– Manual checking of precision
1
12
0.9
10
0.8
0.7
Precision
0.6
0.5
6
0.4
4
0.3
0.2
# found types
8
# found
types
precision
2
0.1
0
0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Lower bound for threshold
10/31/13
Heiko Paulheim, Christian Bizer
15
16. Evaluation Results
•
DBpedia:
– works reasonably well (F-measure 0.89)
•
OpenCyc:
– harder because of deeper class hierarchy (F-measure 0.60)
•
General:
– having more links increases precision
(in contrast to RDFS reasoning)
– more general types (e.g., Band) are easier than specific ones
(e.g., PunkRockBand)
10/31/13
Heiko Paulheim, Christian Bizer
16
17. Deployment
•
Heuristic types have been included in DBpedia 3.9
– for previously untyped instances
– 3.4 million type statements at precision ~0.95
•
Includes also many resources without a Wikipedia page
– i.e., generated from a red link
•
Runtime
– Complexity O(PT)
P: number of property assertions
T: number of type assertions
– ~24h for processing DBpedia
10/31/13
Heiko Paulheim, Christian Bizer
17
18. Conclusion and Outlook
•
SDType approach works at high quality
– outperforms most state of the art on DBpedia
– deployed for DBpedia 3.9
•
Same approach can be used for
– validating links
– within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)
– across datasets: to be done
10/31/13
Heiko Paulheim, Christian Bizer
18
19. Type Inference on Noisy RDF Data
10/31/13 Paulheim, Christian Bizer
Heiko Paulheim, Christian Bizer
Heiko
19