But what do we actually know - On knowledge base recall

But what do we actually know?
On knowledge base recall
Simon Razniewski
Free University of Bozen-Bolzano, Italy

Background
• TU Dresden, Germany: Diplom (Master) 2010
• Free University of Bozen-Bolzano, Italy
• 2011 - 2014: PhD (on reasoning about data completeness)
• 2014 - now: Fixed term assistant professor
• Research visits at UCSD (2012), AT&T Labs-Research (2013), UQ (2015)
Bolzano
Trilingual Ötzi 1/8th of EU apples 2

How complete are knowledge bases?
3
=recall

KBs are pretty incomplete
DBpedia: contains 6 out of 35
Dijkstra Prize winners 
YAGO: the average number of children
per person is 0.02 
Google Knowledge Graph:
``Points of Interest’’ – Completeness? 
4

KBs are pretty complete
Wikidata: 2 out of 2
children of Obama 
Google Knowledge Graph: 36 out of 48 Tarantino movies 
DBpedia: 167 out of 199 Nobel laureates in Physics 
5

[Dong et al., KDD 2014]
KB engineers have only tried to make
KBs bigger. The point, however is to
understand what they are actually
trying to approximate.
There are known knowns; there are
things we know we know. We also know
there are known unknowns; that is to
say we know there are some things we
do not know. But there are also
unknown unknowns – the ones we
don't know we don't know.
7

Knowledge Bases as seen by [Rumsfeld, 2002]
Known knowns: The plain facts in a KB
• Trump’s birth date
• Hillary’s nationality
• …
Known unknown: The easy stuff
• NULL values/blank nodes
• Missing functional/mandatory values
Unknown unknowns: The interesting rest
• Are all children of John there?
• Does Mary play a musical instrument?
• Does Bob have other nationalities?
• …
8
Not KB completion!
• What other children does John have?
• Which instruments does Mary play?
• Which nationalities does Bob have?

Outline
1. Assessing completeness from inside the KB
a) Rule mining
b) Classification
2. Assessing completeness using text
c) Cardinalities
d) Recall-aware information extraction
3. Presenting the completeness of KBs
4. The meaning of it all
e) When is an entity complete?
f) When is an entity more complete than another?
g) Are interesting facts complete?
9

1. Asessing completeness from inside the KB
10

1a) Rule Mining [Galarraga et al., WSDM 2017]
hockeyPlayer(x)  Incomplete(x, hasChild)
scientist(x), hasWonNobelPrize(x)  Complete(x, graduatedFrom)
Challenge: No proper theory for consensus across multiple rules
human(x)  Complete(x, graduatedFrom)
teacher(x)  Incomplete(x, graduatedFrom)
professor(x)  Complete(x, graduatedFrom)
John ∈ (human, teacher, professor)  Complete(John, graduatedFrom)?
 Maybe the wrong approach?
11

1b) A classification problem
Input:
Entity e
Predicate p
Question:
Are all triples (e, p, _) in the KB?
Output:
Yes/No
Features: Facts, popularity measures, textual context, …
Training data: Crowdsourcing under constraints, deletion, popularity, …
Obama
hasChild
(Obama, hasChild , _)
Yes (Wikidata)
12

13

2c) Cardinality extraction [Mirza et al., Poster@ISWC 2016]
Text: “Barack and Michelle have two children, and […]”
Manually created patterns to extract children cardinalities from Wikipedia
 Found that about 2k entities have complete children, 84k have incomplete children
 Found evidence for 178% more children than currently in Wikidata
• Especially intriguing for long-tail entities
Open: Automation, other relations
KB: 0 KB: 1 KB: 2
Recall: 0% Recall: 50% Recall: 100%
14

2d) Recall-aware Information Extraction
Textual information extraction is usually precision-aware
“John was born in Malmö, Sweden, on […].” citizenship(John, Sweden) – precision 95%
“John grew up in Malmö, Sweden and […]” citizenship(John, Sweden) – precision 80%
What about making it recall-aware?
“John has a son, Tom, and a daughter, Susan.” hc(John, Tom), hc(John, Susan) – recall?
“John brought his children Susan and Tom to school.” hc(John, Tom), hc(John, Susan) – recall?
15

16

How complete is Wikidata for children?
17

hasChild
date of birth
party membership
….
Facets
Occupation
 Politician 7.5%
 Soccer player 3.3%
 Lawyer 8.1%
 Other 2.2%
Nationality
 USA 3.8%
 India 2.7%
 China 2.2%
 England 5.5%
 …
Century of birth
 <15th century 1.1%
 16th century 1.4%
 …
Gender
 Male 4.3%
 Female 3.9%
Select attribute to analyse
Extrapolated completeness: 30.8%
Known completeness: 2.7%
Based on:
• There are 5371 people of this kind
• For these, 231 have children
• For these, Wikipedia says there should be 750 children
• Average number of children of complete entities is 2.3
• Average number of children of unknown people is 0.01
• …..
18

4e) When is data about an entity complete?
Complete(Justin Bieber)?
• Musician: birth date, musical instrument played, band
• Scientist: alma mater, field, advisor, awards
• Politician: Party, public positions held
 What about musicians playing in an orchestra?
 What about scientists that are also engaged in politics?
…..
• Interestingness is relative (“birth date more interesting than handedness”)
• Long tail of rare properties
 Some work on ranking predicates by relevance
Shortcoming: Mostly descriptive (see e.g. Wikidata Property Suggestor)
20

4f) When is an entity more complete than another?
Is data about Obama more complete than about Trump?
Goal: A notion of relative completeness
Is data about Ronaldo more complete than about Justin Bieber?
…..
Crowd studies: relative completeness = fact count?
Available as user script for Wikidata (Recoin - Relative Completeness Indicator)
21

https://www.wikidata.org/wiki/User:Ls1g/Recoin

4g) Are interesting facts complete?
LIGO:
Proved gravitation waves that were predicted by Einstein 80 years ago
Galileo Galilei:
Contrary to the dogma of the time, postulates that the earth orbits the sun
Reinhold Messner:
First person to climb all mountains >8000mt without supplemental oxygen
These are not elementary triples
FirstPersonToClimbAllMountainsAbove8000Without(Supplemental oxygen, Reinhold Messner)
1. What are these? Events? Sets of triples? Queries?
2. Where can we get the interestingness score from? Entropy? Pagerank? Text frequency?
3. Completeness depends on completeness of context!
23

Summary
1. Assessing completeness from inside the KB
a) Rule mining
b) Classification
c) Cardinalities
d) Recall-aware information extraction
4. The meaning of it all
e) When is an entity complete?
f) When is an entity more complete than another?
g) Are interesting facts complete?
…meet me in room 416

But what do we actually know - On knowledge base recall

Recommandé

Recommandé

Contenu connexe

Similaire à But what do we actually know - On knowledge base recall

Similaire à But what do we actually know - On knowledge base recall (20)

Dernier

Dernier (20)

But what do we actually know - On knowledge base recall

Notes de l'éditeur