In the early 1990s, the term 'semantic' appeared in the context of text retrieval tools. However, from the very beginning of Information Retrieval as a research field (i.e. as computer-assisted identification of relevant documents), looking at the articles of Vannevar Bush (How we may think) or Luhn (The automatic creation of literature abstracts) in the 1940s and '50s, the idea of semantics was already there.
So where are we now in terms of semantics? The `latent semantic indexing` of the 1990s faded away, and the first decade of the millennium enthusiastically studied semantic web technologies. Now, in the second decade, `deep learning` is the new star. In this talk I will give a high-level overview of what has been done already, particularly in the context of the patent domain, what the main techniques are, and in which directions is the scientific community looking today. Ultimately, there will be no one answer to the question of 'What is semantic search?'. Instead, my aim is to empower the audience to ask the right questions next time somebody mentions the term.
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
II-SDV 2017: Semantic Search Jargon - A short Guide
1. Semantic Search Jargon – a short guide
Mihai Lupu
TU Wien / RSA Data Science
mihai.lupu@researchstudio.at
2. “Semantic”
▪ adjective
– dictionary.com: of, relating to, or arising from the different meanings of
words or other symbols
– Merriam-Webster: of or relating to the meanings of words and phrases
– Cambridge: connected with the meanings of words
– Oxford: connected with the meaning of words and sentences
7. The geometric metaphor of meaning
“Meanings are locations in a semantic
space, and semantic similarity is proximity
between the locations”
(Sahlgren, 2006)
9. and others
pure counting
term frequency
position in sentence
SMART
IDF
cosine similarity
and many more
195
196
197
198
199
200
201
202
from counting to predicting
Latent
Semantic
Analysis
Random
Indexing
WWW
appears
Semantic
Web
appears
Deep
Learning
Speech
Vision
NLP
IR
The Golden Age of
Artificial Intelligence Expert Systems,
Knowledge
bases (e.g. Cyc)
Inference
on billions
of tuples
on trillions
Probabilistic
models for IR
Language Models
10. where are we now?
▪ Inference directly from text
▪ [Bowman et al. 2016]
A man rides a bike on
a snow covered road
A man is outside
2 female babies
eating chips
Two female babies are
enjoying chips
A man in an apron
shopping at a market
A man in an apron is
preparing dinner
Model %
Accur
acy
Feature-based classifier 78.2
Previous SOTA sentence
encoder [Mou et al. 2016]
82.1
LSTN RNN sequence model 80.6
Tree LSTM 80.9
SPINN 83.2
SOTA (sentence pair
alignment model) [Parikh et
al. 2016]
86.8
11. where are we now?
▪ Inference directly from text
▪ [Bowman et al. 2016]
A man rides a bike on
a snow covered road
A man is outside
2 female babies
eating chips
Two female babies are
enjoying chips
A man in an apron
shopping at a market
A man in an apron is
preparing dinner
Model %
Accur
acy
Feature-based classifier 78.2
Previous SOTA sentence
encoder [Mou et al. 2016]
82.1
LSTN RNN sequence model 80.6
Tree LSTM 80.9
SPINN 83.2
SOTA (sentence pair
alignment model) [Parikh et
al. 2016]
86.8
Particular success cases:
Negation:
- The rhythmic gymnast completes her floor exercise at the competition
- The gymnast cannot finish her exercise
Long examples (>20 words):
- A man wearing glasses and a ragged costume is playing a Jaguar electric
guitar and singing with the accompaniment of a drummer
- A man with glasses and a disheveled outfit is playing a guitar and singing
along with a drummer.
12. Where are we for patents?
▪ Latent Semantic Indexing
– Some commercial systems claim
to use it
▪ “Latent semantic analysis uses
sophisticated statistical
analysis of language to search
on concepts, not just words, to
help you find those documents
- even if they don't contain any
of the words you used in your
search”
– Minimal improvements found in
experiments
▪ [Moldovan:2005]
13. Random Indexing
▪ Initial experiments using the Semantic Vectors package
– Unsatisfactory results for document similarity
– Noticeably good results for term similarity
Term vectors
Document vectors
[Lupu et al.:2013]
14. Random Indexing
▪ Initial experiments using the Semantic Vectors package
– Unsatisfactory results for document similarity
– Noticeably good results for term similarity
Term vectors
Document vectors
1.0:coatings
0.9999339:rubs
0.9999338:coating
0.9999328:acrylics
0.9999271:vinyls
0.9999268:cratering
0.9999251:distinctness
0.9999246:blistering
0.9999235:pompano
0.9999234:cyanamid
1.0:crystal
0.9999378:cyrstal
0.9999305:crytal
0.9999022:nicol // a type of prism
0.9999014:jjap
0.9999006:nicols
0.9998996:nematic // a type of liquid crystal
0.9998943:uniaxial //minerals that form crystals used in optics
0.9998894:cb15 //a particular liquid crystal
0.9998887:anisotropy
1.0:crystals
0.9998632:supersaturation
0.9998519:crystallizing
0.9998281:supersaturated
0.9998213:crys
0.9998193:purer
0.9998166:soda
0.9998120:crystallize
0.9998105:crystallizers
0.9998081:tals
[Lupu et al.:2013]
19. documents are too large
Particular success cases:
Negation:
- The rhythmic gymnast completes her floor exercise at the competition
- The gymnast cannot finish her exercise
Long examples (>20 words):
- A man wearing glasses and a ragged costume is playing a Jaguar electric
guitar and singing with the accompaniment of a drummer
- A man with glasses and a disheveled outfit is playing a guitar and singing
along with a drummer.
20. words are too simple
“In a railroad car truck, a windowed side frame, a bolster extending
through the window, a wedge pocket in said bolster having an
upwardly and outwardly inclined floor in opposition to a vertical
wear surface on the side frame, a stabilizing wedge in the pocket
having a vertical friction surface in contact with the wear surface on
the side frame and an inclined wedging surface in opposition to the
floor of the pocket, a removable wear plate inset in a recess In said
inclined floor, said recess having a horizontal lower edge, said wear
plate having an inclined lower edge formed and adapted to engage
and be supported on said horizontal lower edge of said recess, said
wear plate being held in said recess by a weldment located
between the upper edge of said recess and the lower edge of said
wear plate, and, a spring biasing the wedge upwardly against the
removable wear plate to cam the wedge laterally against the wear
surface on the side frame.”
How much is the patent corpus covered by the CELEX
lexical database?
[Verberne et al., 2010]
Patent data COBUILD corpus
Tokens 96% 92%
Types 55% (?)
22. words are too simple
Query Generation [Andersson:2016]
– Baseline, NLP:(word, phrases) and Statistically:(unigram, bigram)
– Section Claims or entire document
– Termhood
▪ Experiment to learn termhoodness, two sample sets:
– 637 with C-value and 4,400 without C-value
▪ upper boundary (manual list) versus machine learning
▪ Skip-gram versus exact phrase,
▪ Technical terms versus or non-technical
24. Artificial Intelligence - Will it ever come?
a machine will pass the Turing test by 2029
(Kurzweil 1999, pp. 189-235.)
* The Turing Test does not
specify the use of patents
in the conversation
26. Glossary
▪ CBOW Continuous Bag-of-Words
▪ DBPedia Automatically extracted knowledge resource from Wikipedia
▪ dimensionality reduction Any procedure that takes as input a vector of size N and outputs a vector of size
M<N
▪ feed-forward a particular type of neural network, which does not contain cycles between its neurons
▪ hypernym a term denoting a broader category than another
▪ hyponym a term denoting a narrower category than another
▪ LOD Linked Open Data
▪ LSA Latent Semantic Analysis
▪ LSI Latent Semantic Indexing
▪ LSTM Long Short Term Memory
▪ matrix decomposition a mathematical procedure to represent a matrix as the product of two or more
matrices
▪ matrix factorization matrix decomposition
▪ neural networks an algorithmic model (loosely) simulating brain structures
▪ ontology (here) a knowledge representation resource
▪ OWL Web Ontology Language
▪ PCA Principal Component Analysis
▪ PMI Pointwise Mutual Information
▪ RDF Resource Description Framework
▪ recurrent nn a particular type of neural network, which contains cycles between its neurons
▪ RI Random Indexing
▪ skip-grams method to predict a context from a word
▪ SVD Singular Value Decomposition
▪ WordNet a large lexical database of English