This document discusses semantic similarity measures and hybrid measures for semantic relation extraction. It summarizes a presentation given by Alexander Panchenko on similarity measures. The presentation covers pattern-based measures, comparisons of different measures, hybrid measures that combine multiple single measures, and applications of semantic similarity measures like a lexico-semantic search engine and file categorization system. Evaluation shows that supervised hybrid measures like Logit outperform single measures based on precision-recall.
Similarity Measures for Semantic Relation Extraction
1. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Similarity Measures for
Semantic Relation Extraction
Mont Clair State University, Brown Bag Seminar (USA)
Alexander Panchenko
Universit´e catholique de Louvain &
Ditital Society Laboratory LLC
alexander.panchenko@uclouvain.be
May 2, 2014
Alexander Panchenko 1/52
2. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Plan
1 The Context and the Problem
2 Pattern-Based Semantic Similarity Measure
3 Comparison of Similarity Measures
4 Hybrid Semantic Similarity Measures
5 Applications of Semantic Similarity Measures
Alexander Panchenko 2/52
3. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Plan
1 The Context and the Problem
2 Pattern-Based Semantic Similarity Measure
3 Comparison of Similarity Measures
4 Hybrid Semantic Similarity Measures
5 Applications of Semantic Similarity Measures
Lexico-Semantic Search Engine “Serelex”
Filename Categorization System “iCOP”
Alexander Panchenko 3/52
4. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Computational Lexical Semantics
* Picture is adapted from Computational Linguistics LINGI2263 course
http://www.uclouvain.be/en-cours-2013-LINGI2263.html
Alexander Panchenko 4/52
5. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Introduction
Motivation
1 Synonyms, hypernyms and co-hyponyms are useful for:
text similarity (ˇSaric et al., 2012);
query expansion (Hsu et al., 2006);
question answering (Sun et al., 2005);
2 Manual resource construction is prohibitively expensive.
3 Extractors do not meet quality of the handcrafted resources.
Focus
Similarity-based semantic relation extraction.
Research Question
How to improve precision and coverage of such measures?
Alexander Panchenko 5/52
6. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Semantic Resources
Definition
A semantic resource is an undirected graph (C, R):
nodes C represent terms;
edges R represent untyped semantic relations.
Alexander Panchenko 6/52
7. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Semantic Relation Extractors
We study extractors based on two components:
1 semantic similarity measures;
2 nearest neighbors procedures.
Terms
Similarity Measure
R
S
Normalizer
S
Semantic Similarity Measure
Semantic Relations
Feature Extractor
Text-Based Data
kNN Procedure
F
C
Semantic Relation Extractor
Alexander Panchenko 7/52
8. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Semantic Similarity Measures
Definition
A semantic similarity measure quantifies semantic relatedness input
terms ci , cj with the similarity score sij = sim(ci , cj ):
sij =
high if ci , cj is a pair of syn, hyper, cohypo
0 otherwise
Properties
Nonnegativity: 0 ≤ sij ≤ 1;
Reflexivity: sij = 1 ⇔ ci = cj ;
Symmetry: sij = sji ;
Triangle inequality: sij ≤ sik + skj
Alexander Panchenko 8/52
9. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Semantic Similarity Measures
Many dissimilar pairs, few similar pairs: sij ∼ exp(λ):
Similarity distribution of the term “doctor”:
Alexander Panchenko 9/52
10. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Evaluation of Semantic Similarity Measures
1 Correlations with human judgments:
Criterion: Pearson correlation (ρ) и Spearman correlation (r).
Datasets: MC, RG, WordSim.
2 Semantic relation ranking:
Criterion: Precision, Recall, F-measure.
Dataset: BLESS, SN.
3 Semantic relation extraction:
Criterion: Precision@k.
Data: annotation and/or dictionaries.
4 Application-based evaluation:
short text classification system (iCOP);
lexico-semantic search engine (Serelex).
Panchenko A., Similarity Measures for Semantic Relation
Extraction. PhD thesis. Universit´e catholique de Louvain. 197
pages, 2013, (Chapter 1).
Alexander Panchenko 10/52
11. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Correlations with human judgments
Alexander Panchenko 11/52
12. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Semantic Relation Ranking
Precision P(k = 50) = 1
7 ≈ 0.86
word, ci word, cj relation type sij
aficionado enthusiast syn 0.07197
aficionado fan syn 0.05195
aficionado admirer syn 0.01964
aficionado addict syn 0.01326
aficionado devotee syn 0.01163
aficionado foundling random 0.00777
aficionado fanatic syn 0.00414
aficionado adherent syn 0.00353
aficionado capital random 0.00232
aficionado statute random 0.00029
aficionado blot random 0.00025
aficionado meddler random 0.00005
aficionado enlargement random 0.00003
aficionado bawdyhouse random 0.00000
Alexander Panchenko 12/52
13. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Plan
1 The Context and the Problem
2 Pattern-Based Semantic Similarity Measure
3 Comparison of Similarity Measures
4 Hybrid Semantic Similarity Measures
5 Applications of Semantic Similarity Measures
Lexico-Semantic Search Engine “Serelex”
Filename Categorization System “iCOP”
Alexander Panchenko 13/52
14. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Related publications
This work stems from Hearst, M. A. Automatic acquisition of
hyponyms from large text corpora. In ACL, pages 539–545,
1992.
Selected publications:
Panchenko A., Morozova O., Naets H. A Semantic
Similarity Measure Based on Lexico-Syntactic Patterns.
In Proceedings of KONVENS 2012, pp.174–178, Vienna
(Austria), 2012
Panchenko A., Romanov P., Morozova O., Naets H.,
Philippovich A., Fairon C. Serelex: Search and
Visualization of Semantically Related Words. In
Proceedings of the 35th European Conference on Information
Retrieval (ECIR 2013), Moscow (Russia), 2013.
Alexander Panchenko 14/52
15. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
A live demo
http://serelex.cental.be/
Alexander Panchenko 15/52
16. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Lexico-syntactic patterns
18 patterns that extract hypernyms, co-hyponyms and
synonyms
Alexander Panchenko 16/52
17. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Patterns are encoded as FSTs
Finite State Transducers (FSTs)
Open source corpus processing tool Unitex:
http://igm.univ-mlv.fr/~unitex/
Alexander Panchenko 17/52
18. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
A pattern encoded as an FST
Take into account linguistic variation
Unlike string-based patterns (Bollegala et al., 2007)
Alexander Panchenko 18/52
19. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Patterns extract concordances
such diverse {[occupations]} as {[doctors]},
{[engineers]} and {[scientists]}[PATTERN=1]
such {non-alcoholic [sodas]} as {[root beer]} and
{[cream soda]}[PATTERN=1]
{traditional[food]}, such as
{[sandwich]},{[burger]}, and {[fry]}[PATTERN=2]
Alexander Panchenko 19/52
20. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Corpus
Corpus Wikipedia+ukWaC: 2.9 · 1012 tokens
Extracted concordances
Wikipedia – 1.196.468
ukWaC – 2.227.025
WaCypedia+ukWaC – 3.423.493
Alexander Panchenko 20/52
21. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Reranking formula Efreq-Rnum-Cfreq-Pnum
sij =
√
pij ·
2 · µb
bi∗ + b∗j
·
P(ci , cj )
P(ci )P(cj )
.
P(ci , cj ) =
eij
ij eij
– extraction probability of the pair ci , cj ,
eij – frequency of co-occurrence of ci and cj in concordances K
P(ci ) = fi
i fi
– probability of the term ci , fi – frequency of ci
bi∗ = j:eij ≥β 1 – the number of extractions for term ci with
the frequency ≥ β, µb = 1
|C|
|C|
i=1 bi∗ – the average number
of extractions per term
pij ∈ [1; 18] – number of distinct patterns which extracted the
relation ci , cj
Alexander Panchenko 21/52
22. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Semantic Relation Ranking
Precision is comparable or better w.r.t. the baselines;
Recall is lower w.r.t. the baselines.
Figure : Precision-Recall graphs (the BLESS dataset).
Alexander Panchenko 22/52
23. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Semantic Relation Extraction
Precision@1 ≈ 0.80;
“Good” coverage:
Alexander Panchenko 23/52
24. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Plan
1 The Context and the Problem
2 Pattern-Based Semantic Similarity Measure
3 Comparison of Similarity Measures
4 Hybrid Semantic Similarity Measures
5 Applications of Semantic Similarity Measures
Lexico-Semantic Search Engine “Serelex”
Filename Categorization System “iCOP”
Alexander Panchenko 24/52
25. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Related publications
Panchenko A. A Study of Heterogeneous Similarity
Measures for Semantic Relation Extraction. // In
JEP-TALN-RECITAL 2012 — Grenoble (France), 2012.
Panchenko A., Similarity Measures for Semantic Relation
Extraction. PhD thesis. Universit´e catholique de Louvain.
197 pages, 2013: Chapters 2.1, 3.1.
Alexander Panchenko 25/52
26. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Compared Semantic Similarity Measures
37 distinct measures;
Q1: Are the measures are complementary?
Q2: If yes, in which respects?
Alexander Panchenko 26/52
27. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
The Best Single Measures (MC, RG, WordSim, BLESS, SN)
Each one extracts many co-hyponyms, e.g.:
Canon, Nikon ,
Lamborghini, Ferrari ,
Obama, Romney .
Alexander Panchenko 27/52
28. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Further Results
Most dissimilar measures
Figure : 21 measures grouped according to
their relation distributions.
Measures are
complementary w.r.t.:
lexical coverage;
performances;
types of semantic
relations they extract.
Alexander Panchenko 28/52
29. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Implementation of the baseline measures
Semantic Vectors:
https://code.google.com/p/semanticvectors/
S-Space Package:
https://code.google.com/p/airhead-research/
WordNet::Similarity:
http://wn-similarity.sourceforge.net
NLTK: http://nltk.googlecode.com/svn/trunk/doc/
howto/wordnet.html
WikiRelate!
PatternSim / Serelex: http://serelex.cental.be
Web-based metrics:
http://cwl-projects.cogsci.rpi.edu/msr
LSA: http://lsa.colorado.edu
Alexander Panchenko 29/52
30. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Plan
1 The Context and the Problem
2 Pattern-Based Semantic Similarity Measure
3 Comparison of Similarity Measures
4 Hybrid Semantic Similarity Measures
5 Applications of Semantic Similarity Measures
Lexico-Semantic Search Engine “Serelex”
Filename Categorization System “iCOP”
Alexander Panchenko 30/52
31. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Related publications
Panchenko A., Morozova O. A Study of Hybrid Similarity
Measures for Semantic Relation Extraction. // Innovative
Hybrid Approaches to the Processing of Textual Data
Workshop, EACL 2012 — Avignon (France), 2012 — pp. 10–18
Panchenko A., Similarity Measures for Semantic Relation
Extraction. PhD thesis. Universit´e catholique de Louvain.
197 pages, 2013, (Chapter 4).
Panchenko A. A Study of Heterogeneous Similarity
Measures for Semantic Relation Extraction. // In
JEP-TALN-RECITAL 2012 — Grenoble (France), 2012 — pp.
29–42.
Alexander Panchenko 31/52
32. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Hybrid vs Single Measures
Terms, C
simi
(a) (b)
combination method
Scmb
S1 SN
sim1
S1
simN
norm
SN
...
...norm
norm
Scmb
knn
R
Si
norm
Si
knn
SingleSimilarityMeasure
HybridSimilarityMeasure
Relations,
Terms, C
RRelations,
Features
Figure : Semantic relation extractor based on:
(a) a single similarity measure;
(b) a hybrid similarity measure.
Alexander Panchenko 32/52
33. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
16 Features = 16 Single Similarity Measures
5 network-based measures :
1 WuPalmer;
2 Leacock and Chodorow;
3 Resnik;
4 Jiang and Conrath;
5 Lin.
3 web-based measures (NGD-Yahoo/Bing/Google);
5 corpus-based measures:
2 distributional (BDA, SDA)
1 lexico-syntactic patterns (PatternSim)
2 other co-occurence based (LSA, NGD-Factiva)
3 definition-based measures
1 ExtendedLesk;
2 GlossVectors;
3 DefVectors-WktWiki.
Alexander Panchenko 33/52
34. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Unsupervised Combination Methods
1 Mean: scmb
ij = 1
K k=1,K sk
ij ;
2 Mean-Nnz: scmb
ij = 1
|k:sk
ij >0,k=1,K| k=1,K sk
ij ;
3 Mean-Zscore: Scmb = 1
K
K
k=1
Sk −µk
σk
;
4 Median: scmb
ij = median(s1
ij , . . . , sK
ij );
5 Max: scmb
ij = max(s1
ij , . . . , sK
ij );
6 RankFusion: scmb
ij = 1
K k=1,K rk
ij ;
7 RelationFusion (Panchenko and Morozova, 2012).
Alexander Panchenko 34/52
35. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Supervised Combination Methods
8 Logit, Logit-L1, Logit-L2.
A binary logistic regression;
Positive examples – synonyms, hyponyms, co-hyponyms from
BLESS/SN;
Negative examples – random relations from BLESS/SN;
A relation ci , t, cj ∈ R is represented with a vector of
pairwise similarities: x = (s1
ij , . . . , sN
ij ), N = 2, 16;
Category yij :
yij =
0 if ci , t, cj is a random relation
1 otherwise
Using the model (w1, . . . , wK ) for combination:
scmb
ij =
1
1 + e−z
, z =
K
k=1
wk sk
ij + w0.
Alexander Panchenko 35/52
36. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Supervised Combination Methods
9 SVM.
The weights w and the support
vectors SV :
w =
xi ∈SV
αi yi xi .
Using the model
scmb
ij = wT
x+b =
K
k=1
wi sk
ij +b.
Alexander Panchenko 36/52
37. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Hybrid Similarity Measures
Precision-Recall graphs calculated on the BLESS dataset:
(a) 16 single measures and the best hybrid measure Logit-E15;
(b) 8 hybrid measures.
Alexander Panchenko 37/52
38. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Hybrid Similarity Measure Logit-E15
Figure : Similarity scores between 74 words related to the word “acacia”.
Alexander Panchenko 38/52
39. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Supervised Hybrid Similarity Measures
Alexander Panchenko 39/52
40. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Supervised Hybrid Similarity Measures (cont.)
Figure : Meta-parameter optimization with the grid search of the
C-SVM-radial-E15 measure.
Alexander Panchenko 40/52
41. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Plan
1 The Context and the Problem
2 Pattern-Based Semantic Similarity Measure
3 Comparison of Similarity Measures
4 Hybrid Semantic Similarity Measures
5 Applications of Semantic Similarity Measures
Lexico-Semantic Search Engine “Serelex”
Filename Categorization System “iCOP”
Alexander Panchenko 41/52
42. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Lexico-Semantic Search Engine “Serelex”
Plan
1 The Context and the Problem
2 Pattern-Based Semantic Similarity Measure
3 Comparison of Similarity Measures
4 Hybrid Semantic Similarity Measures
5 Applications of Semantic Similarity Measures
Lexico-Semantic Search Engine “Serelex”
Filename Categorization System “iCOP”
Alexander Panchenko 42/52
43. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Lexico-Semantic Search Engine “Serelex”
Related publications
Panchenko A., Romanov P., Morozova O., Naets H.,
Philippovich A., Fairon C. Serelex: Search and
Visualization of Semantically Related Words. In
Proceedings of the 35th European Conference on Information
Retrieval (ECIR 2013), Moscow (Russia), 2013.
Panchenko A., Naets H., Brouwers L., Romanov P., Fairon C.,
Recherche et visualisation de mots s´emantiquement li´es.
Actes de la 20e conf´erence sur le Traitement Automatique des
Langues Naturelles (TALN’2013). Les Sables d’Olonne,
France. pp.747–754, 2013.
Alexander Panchenko 43/52
44. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Lexico-Semantic Search Engine “Serelex”
Search for Related Words: the List and the Graph
http://serelex.cental.be/
Alexander Panchenko 44/52
45. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Lexico-Semantic Search Engine “Serelex”
Search for Related Words: the List and the Graph
Alexander Panchenko 45/52
46. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Lexico-Semantic Search Engine “Serelex”
Search for Related Words: the Images
Alexander Panchenko 46/52
47. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Lexico-Semantic Search Engine “Serelex”
Evaluation of the Serelex
Figure : Users’ satisfaction with the top 20 results.
Alexander Panchenko 47/52
48. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Filename Categorization System “iCOP”
Plan
1 The Context and the Problem
2 Pattern-Based Semantic Similarity Measure
3 Comparison of Similarity Measures
4 Hybrid Semantic Similarity Measures
5 Applications of Semantic Similarity Measures
Lexico-Semantic Search Engine “Serelex”
Filename Categorization System “iCOP”
Alexander Panchenko 48/52
49. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Filename Categorization System “iCOP”
Related publications
Panchenko A., Naets H., Beaufort R., Fairon C. Towards
Detection of Child Sexual Abuse Media: Classification of
the Associated Filenames. In Proceedings of the 35th
European Conference on Information Retrieval (ECIR 2013).
LNCS 7814, pp. 776-779. Springler-Verlag Berlin Heidelberg
2013.
Panchenko A, Beaufort R., Fairon C. Detection of Child
Sexual Abuse Media on P2P Networks: Normalization
and Classification of Associated Filenames. In
Proceedings of Workshop on Language Resources for Public
Security Applications of the 8th International Conference on
Language Resources and Evaluation (LREC), 2012
Alexander Panchenko 49/52
50. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Filename Categorization System “iCOP”
Short text classification with Vocabulary Projection
Alexander Panchenko 50/52
51. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Filename Categorization System “iCOP”
Evaluation of the Vocabulary Projection
Training Dataset Test Dataset Accuracy Accuracy (voc. projection)
Gallery (train) Gallery 96.41 96.83 (+0.42)
PirateBay Title+Desc+Tags PirateBay Title+Desc+Tags 98.92 98.86 (–0.06)
PirateBay Title+Tags PirateBay Title+Tags 97.73 97.63 (–0.10)
Gallery PirateBay Title+Desc+Tags 90.57 91.48 (+0.91)
Gallery PirateBay Title+Tags 84.23 88.89 (+4.66)
PirateBay Title+Desc+Tags Gallery 88.83 89.04 (+0.21)
PirateBay Title+Tags Gallery 91.16 91.30 (+0.14)
Table : Performance of an C-SVM linear classifier (10-fold cross
validation).
Alexander Panchenko 51/52
52. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications
Filename Categorization System “iCOP”
Thank you! Questions?
Alexander Panchenko 52/52