Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back)

8/11/21 Heiko Paulheim 1
Using Knowledge Graphs in Data Science –
From Symbolic to Latent Representations
and a few Steps Back
Heiko Paulheim
University of Mannheim
Heiko Paulheim

Brief Introduction
2006 2008 2011 2013 2014 2017
Pre PhD Years PhD Years PostDoc Years Assistant Prof. Full Prof.
SDType
rdf2vec
ReNewRS
Kare§KoKI
MELT

Knowledge Graphs: At a Glance
• Graph shaped knowledge representation
– nodes: entities
– edges: relations
Mannheim
Baden-
Württemberg
Germany
Heiko Paulheim
DWS Group
employer
a
f
f
il
i
a
t
io
n
part of
residence
s
t
a
t
e
part of

Knowledge Graphs in Organizations
• Knowledge Graphs are used…
• …in companies and
organizations
– collect, organize,
and integrate knowledge
– link isolated
information sources
– make information
searchable and findable
Masuch et al., 2016

Public Knowledge Graphs
• Knowledge Graphs are used…
• …as (free), public resources
– collect common knowledge
– general purpose, not task specific
– make it easy to build knowledge-intensive applications

Usage of Public Knowledge Graphs
OK, Google, when will the final
season of Money Heist be on Netflix?
The fifth season of Money Heist
will be released on September 3rd
.

2021-09-03
2020-04-03
release date
release date
has part
h
a
s
p
a
r
t
OK, Google, when will the final season
Money Heist be on Netflix?
.
.
.

2021-09-03
2020-04-03
release date
release date
creator
has part
h
a
s
p
a
r
t
cast
c
a
s
t
creator
c
a
s
t
Are there any other series
by the same creator?
creator
cast
cast .
.
.
.
.
.

History: CyC
• The beginning
– Encyclopedic collection of knowledge
– Started by Douglas Lenat in 1984
– Estimation: 350 person years and 250,000 rules
should do the job
of collecting the essence of the world’s knowledge
• The present (as of June 2017)
– ~1,000 person years, $120M total development cost
– 21M axioms and rules
– Declared “ready to use” in 2017

History: Freebase
• The 2000s
– Freebase: collaborative editing
– Schema not fixed
• Present
– Acquired by Google in 2010
– Powered first version
of Google’s Knowledge Graph
– Shut down in 2016
– Partly lives on in Wikidata (see in a minute)

History: Wikidata
• The 2010s
– Wikidata: launched 2012
– Goal: centralize data from Wikipedia languages
– Collaborative
– Imports other datasets
• Present
– One of the largest
public knowledge graphs
– Includes rich provenance

History: DBpedia & co.
• The 2010s
– DBpedia: launched 2007
– YAGO: launched 2008
– Extraction from Wikipedia
using mappings & heuristics
• Present
– Two of the most used knowledge graphs
– ...with Wikidata catching up

History: NELL
• The 2010s
– NELL: Never ending language learner
– Input: ontology, seed examples, text corpus
– Output: facts, text patterns
– Large degree of automation,
occasional human feedback
• Until 2018
– Continuously ran for ~8 years
– New release every few days
http://rtw.ml.cmu.edu/rtw/overview

Knowledge Graph Creation
• Sources for generating knowledge graphs:
– Manual (also: crowd sourcing) curation
• Cyc, Freebase, Wikidata, ...
– (Semi-)structured knowledge (Wikis, databases, …)
• DBpedia, YAGO, BabelNet, ...
– Unstructured text or web page collections
• NELL, DeepDive, ReVerb, …

Knowledge Graph Creation – Ongoing Projects
• WebIsA & WebIsALOD
– 400M hypernyms extracted from a Web Crawl
Seitner et al. (2016): A Large DataBase of Hypernymy Relations Extracted from the Web

• DBkWik
– Harvesting data from 400k Wikis
Paulheim & Hertling (2018): DBkWik: A consolidated knowledge graph from thousands of Wikis

• CaLiGraph
– Learning analogies, e.g., from lists
Heist (2018): Towards Knowledge Graph Construction from Entity Co-occurrence

Use Cases for Knowledge Graphs
• Background Knowledge
– e.g., company data (address, CEO, branch, …)
→ SAP CRM (BSc thesis 2019)
– e.g., geographic regions (demographics)
→ for example, sales data prediction
– data interpretation (e.g., Excel tables, business models)
→ PhD thesis under supervision
• Data Integration
– unified view of different data sources
– relating business entities in different systems
– cross-source data visualization and analytics

Knowledge Graphs in Data Science
• Typical cases:
– predictive modeling, information retrieval, recommendation, …
• For all of those, there’s sophisticated implementations
– but...
?

Wanted: A Bridge between Both Worlds

• Data Science tools for prediction etc.
– Python, Weka, R, RapidMiner, …
– Algorithms that work on vectors, not graphs
• Bridges built over the past years:
– FeGeLOD (Weka, 2012), RapidMiner LOD Extension (2015),
Python KG Extension (2021)
?

• Transformation strategies (aka propositionalization)
– e.g., types: type_horror_movie=true
– e.g., data values: year=2011
– e.g., aggregates: nominations=7
?

• Observations with simple propositionalization strategies
– Even simple features (e.g., add all numbers and types)
can help on many problems
– More sophisticated features often bring additional improvements
• Combinations of relations and individuals
– e.g., movies directed by Steven Spielberg
• Combinations of relations and types
– e.g., movies directed by Oscar-winning directors
• …
– But
• The search space is enormous!
• Generate first, filter later does not scale well

• Excursion: word embeddings
– word2vec proposed by Mikolov et al. (2013)
– predict a word from its context or vice versa
• Idea: similar words appear in similar contexts, like
– Jobs, Wozniak, and Wayne founded Apple Computer Company in April
1976
– Google was officially founded as a company in January 2006
– usually trained on large text corpora
• projection layer: embedding vectors

From Word Embeddings to Graph Embeddings
• Basic idea:
– extract random walks from an RDF graph:
Mulholland Dr. David Lynch US
– feed walks into word2vec algorithm
• Order of magnitude (e.g., DBpedia)
– ~6M entities (“words”)
– start up to 500 random walks per entity, length up to 8
→ corpus of >20B tokens
• Result:
– node embeddings
– most often outperform other propositionalization techniques
director nationality

A First Glance at RDF2vec Embeddings
• Observation: close projection of similar entities

Random vs. non-random
• Maybe random walks are not such a good idea
– They may give too much weight on less-known entities and facts
• Strategies:
– Prefer edges with more frequent predicates
– Prefer nodes with higher indegree
– Prefer nodes with higher PageRank
– …
– They may cover less-known entities and facts too little
• Strategies:
– The opposite of all of the above strategies
• External signals (e.g., human notions of importance)
– generally work better than graph-internal signals
Cochez et al. (2017): Biased Graph Walks for RDF Graph Embeddings
Al Taweel and Paulheim (2020): Towards Exploiting Implicit Human Feedback for Improving RDF2vec
Embeddings

Local Embeddings
• Recap: order of magnitude (e.g., DBpedia)
– ~6M entities (“words”)
– start up to 500 random walks per entity, length up to 8
→ corpus of >20B tokens
– “Train once, reuse often”
• In some cases, only a small subset (of 6M) is of interest
– RDF2vec light: “train when needed”
– Runtime: minutes instead of days
Portisch et al. (2020): RDF2Vec Light – A Lightweight Approach for Knowledge
Graph Embeddings

RDF2vec: Example Applications
• Data Model Matching with WebIsA and RDF2vec
Portisch et al. (2019): Evaluating ontology matchers on real-world financial services
data models.

• Entity disambiguation: linking texts to a knowledge graph
Türker et al. (2019): Knowledge-Based Short Text Categorization
Using Entity and Category Embedding

• Finding related research papers on CoViD-19
Steenwinckel et al. (2020): Facilitating COVID-19 Meta-analysis Through a Literature
Knowledge Graph

• Table search by keyword
Zhang and Balog (2018): Ad Hoc Table Retrieval using Semantic Similarity.

• Predicting biological interactions
Sousa et al. (2021): Supervised Semantic Similarity.

• Zero-Shot Image Classification
Tristan Hascoet et al. (2017): Semantic Web and Zero-Shot Learning of Large Scale
Visual Classes.

Embeddings for Link Prediction
• RDF2vec example
– similar instances form clusters, direction of relation is ~stable
– link prediction by analogy reasoning (Japan – Tokyo ≈ China – Beijing)
Ristoski & Paulheim: RDF2vec: RDF Graph Embeddings for Data Mining. ISWC, 2016

Embeddings for Link Prediction
• In RDF2vec, relation preservation is a by-product
• TransE (and its descendants): direct modeling
– Formulates RDF embedding as an optimization problem
– Find mapping of entities and relations to Rn
so that
• across all triples <s,p,o>
Σ ||s+p-o|| is minimized
• try to obtain a smaller error
for existing triples
than for non-existing ones
Bordes et al: Translating Embeddings for Modeling Multi-relational Data. NIPS 2013.
Fan et al.: Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete
Repositories. WI 2016

Link Prediction vs. Node Embedding
• Hypothesis:
– Embeddings for link prediction also cluster similar entities
– Node embeddings can also be used for link prediction
Portisch et al. (under review): Knowledge Graph Embedding for Data Mining vs. Knowledge Graph
Embedding for Link Prediction - Two Sides of the Same Coin?

Similarity vs. Relatedness
• Closest 10 entities to Angela Merkel in different vector spaces
Portisch et al. (under review): Knowledge Graph Embedding for Data Mining vs. Knowledge Graph
Embedding for Link Prediction - Two Sides of the Same Coin?

• (s-)RDF2vec allows an explicit trade off w/ different walk strategies
Mannheim
Baden-
Württemberg
Germany
Adler
Mannheim
SAP Arena
Reiss-
Engelhorn
-Museum
location
location
location
federal
state
country
location
city
stadium
Knowledge Graph
Walk Generation
Adler_Mannheim → city → Mannheim → country → Germany
Adler_Mannheim → stadium → SAP_Arena → location → Mannheim
SAP_Arena → location → Mannheim → country → Germany
...
“Classic” RDF2vec walks
city → Mannheim → country
stadium → SAP_Arena → location
location → Mannheim → country
...
s-RDF2vec walks
+
RDF2vec “union walks”
RDF2vec “classic”
RDF2vec “edge”
concatenated
vector
Global PCA
Test Cases
concatenated
vector
(task-specific
subset)
w
2
w
1
(weighted)
local PCA
Portisch et al. (under review): s-RDF2vec: Injecting Knowledge Graph Structure Into RDF2vec Entity
Embeddings.

• s-RDF2vec
– using different walk strategies
– combining different vector spaces (weighted combinations are possible)
• 10 closest neighbors to Mannheim:
Portisch et al. (under review): s-RDF2vec: Injecting Knowledge Graph Structure Into RDF2vec Entity
Embeddings.

• Recap word embeddings:
– Jobs, Wozniak, and Wayne founded Apple Computer Company in April
1976
– Google was officially founded as a company in January 2006
• Graph walks:
– Hamburg → country → Germany → leader → Angela_Merkel
– Germany → leader → Angela_Merkel → birthPlace → Hamburg
– Hamburg → leader → Peter_Tschentscher → residence → Hamburg
Germany
Angela_Merkel Hamburg
birthPlace
country
leader
Peter_Tschentscher
leader
residence
country

• Surrounding entities indicate relatedness
– Hamburg → country → Germany → leader → Angela_Merkel
• Same entities in similar positions indicate similarity
– Hamburg → leader → Peter_Tschentscher → residence → Hamburg
• Someone is a leader vs. something has a leader
• Solution approach: use embedding approach that respects positions
– CWINDOW / Structured Skip-ngram
Portisch and Paulheim (2021): Putting RDF2vec in Order.

• Why bother?
– Use case: table interpretation (a special case of entity disambiguation)
related
similar

Back to Interpretability
• Hot topic: Explainable AI
– Knowledge Graphs are a favorable ingredient
– Human/machine interpretable knowledge → explainable systems
• However:
– Embeddings replace interpretable axioms
with numeric vectors over non-interpretable dimensions
– Where did the semantics go?
Paulheim (2018): Make Embeddings Semantic Again!

Towards Semantic Vector Space Embeddings
cartoon
superhero
Paulheim (2018): Make Embeddings Semantic Again!

cartoon
superhero
• Approach 1: learn interpretation function
• Each dimension of the embedding model
is a target for a separate learning problem
• Learn a function to explain the dimension
• E.g.:
• Just an approximation used for explanations and justifications
y≈−|∃character .Superhero|

cartoon
superhero
• Approach 2: learn inherently
interpretable embeddings
• Step 1: learn typical patterns
that exist in a knowledge graph
– e.g., graph pattern learning
– e.g., Horn clauses
• Step 2a: use those patterns
as embedding dimensions
– probably not low dimensional
• Step 2b: compact the space
– e.g., use dimensions for mutually exclusive patterns

• Different angle: learn interpretation for similarity function
~similar
type
~same
country
~connected
to same
entity

Summary
• Knowledge Graphs are a versatile ingredient for AI
– Integrated view on data
– Large-scale free source of background knowledge
• Knowledge Graph Embeddings
– Effective processing of large-scale knowledge sources
– Encoding of similarity and/or relatedness
• RDF2vec: explicit trade-off is possible!
– Additional insights that are not explicit in the graph
• aka latent semantics

More on RDF2vec
• Collection of
– Implementations
– Pre-trained models
– >40 use cases
in various domains

Thank you!
http://www.heikopaulheim.com
@heikopaulheim

Using Knowledge Graphs in Data Science –
From Symbolic to Latent Representations
and a few Steps Back
Heiko Paulheim
Heiko Paulheim

Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back)

Similaire à Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back) (20)

Plus de Heiko Paulheim

Plus de Heiko Paulheim (20)

Dernier

Dernier (20)

Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back)