About me
Assistant Professor at University of Amsterdam (2021-)
• Information retrieval, neural networks, natural language processing,
(consumer) medical domain, personal knowledge graphs, …
Previously
• Senior Researcher (2018-2021) at Max Planck Institute for Informatics;
Postdoc at MPII before that (2016-2018)
• PhD in Computer Science from Georgetown University, Washington, DC
• BSc in Computer Science from Illinois Institute of Technology, Chicago
2
3
Book on Neural IR
(link on final slide)
Podcast on Neural IR
Available: Spotify, iTunes, …
4
Students and collaborators: Siddhant Arora, Arman Cohan, Jeffrey Dalton, Doug Downey, Sergey Feldman,
Nazli Goharian, Kevin Martin Jose, Jimmy Lin, Sean MacAvaney, Thong Nguyen, Wei Yang, Xinyu Zhang, …
Capreolus: A Toolkit for End-to-End
Neural Ad Hoc Retrieval
Neural? IR?
Focus on ad hoc retrieval, better known as “googling something”
➔ We all do this a lot, and it works incredibly well
➔ Given an arbitrary query, return ranked list of relevant “documents”
6
query
black bear attacks
...
collection
1.
2.
3.
metric:
0.66
ranking
method
Retrieval 101
• Crawl & index documents (text)
• Query time: look for exact matches in documents,
optionally combining with other signals (e.g., spam score)
7
Retrieval 101
• Consider query terms that appear in document
• Weight them by discriminativeness (inverse document frequency)
and importance (term frequency)
8
IDF component TF component
term saturation length
normalization
Consider only terms in
both query & document
Retrieval 101
9
• Consider query terms that appear in document
• Weight them by discriminativeness (inverse document frequency)
and importance (term frequency)
General formula
Improvement #1: soft matching of terms
Exact matching of
Q and D terms
Non-neural approaches: translation models, topic models like pLSI, ...
11
Inverse Document
Frequency of query term
(collection stat)
Term Frequency of
q in document D
(document stat)
Improvement #2: less handcrafting
(learn how to score)
12
Inverse Document
Frequency of query term
Term Frequency of
q in document D
N = # docs and ni
= # docs with term qi
IDF defined as:
• log N/ni
• log (1 + N/ni
)
• log (N- ni
)/ni
• …
13
Inverse Document
Frequency of query term
Term Frequency of
q in document D
TF defined as:
• TF(qi
, D)
• 1 + log TF(qi
, D)
• 0.5 + 0.5 [TF(qi
, D) / max TF(D)]
• …
14
Soft matching
Key point: neural methods replace exact matching with semantic matching
With traditional methods, soft matching is possible (but it’s less effective)
● Stemming handles matches like singular vs. plural and different tenses
swam, swimming, swim, swims replaced with swim
● Query expansion approaches handle context by adding related terms to query
Query contains bank, referring to turning a plane
Add context terms to query: flight, airplane, ailerons, spoilers, ...
15
Soft matching via embeddings
16
Source: https://medium.com/@hari4om/word-embedding-d816f643140
Neural approaches
represent words
as learned vectors
Soft matching via embeddings
17
Source: https://medium.com/@hari4om/word-embedding-d816f643140
Vector “embedding” allows comparisons of different terms
Soft matching via embeddings
18
Source: https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12
Embeddings capture a general notion of semantic similarity
… … … … … … … …
…
[SEP]
EA
P8
E[CLS]
T[CLS]
[CLS]
E1
T1
A1
E2
T2
A2
E3
T3
A3
E4
T4
[SEP]
E5
T5
B1
E6
T6
B2
E7
T7
E[SEP]
T[SEP]
[CLS] tThe
tbank
tmakes
tloans
t[MASK]
tclients
EA
EA
EA
EA
EA
EA
EA
P0
P1
P2
P3
P4
P5
P6
+ + + + + + + +
+ + + + + + + +
Token
Embeddings
Segment
Embeddings
Position
Embeddings
t.
EA
P7
+
+
The bank makes loans [MASK] clients .
Loss = -log (P("to" | masked input))
Random
masking
19
D ྾ V
softmax
Embeddings can be learned automatically
Randomly-
initialized
Large
Language
Model
(>100 million
weights)
Devlin, Chang, Lee,
Toutanova. BERT: Pre-training
of Deep Bidirectional
Transformers for Language
Understanding. NAACL 2019.
“BERT”
… … … … … … … …
…
[SEP]
EA
P8
E[CLS]
T[CLS]
[CLS]
E1
T1
A1
E2
T2
A2
E3
T3
A3
E4
T4
[SEP]
E5
T5
B1
E6
T6
B2
E7
T7
E[SEP]
T[SEP]
[CLS] tThe
tbank
tmakes
tloans
t[MASK]
tclients
EA
EA
EA
EA
EA
EA
EA
P0
P1
P2
P3
P4
P5
P6
+ + + + + + + +
+ + + + + + + +
Token
Embeddings
Segment
Embeddings
Position
Embeddings
t.
EA
P7
+
+
The bank makes loans [MASK] clients .
Loss = -log (P("to" | masked input))
Random
masking
20
D ྾ V
softmax
Embeddings can be learned automatically
Advantages
● Automatically learn semantic representations of words!
● These embeddings capture general similarity
Disadvantages
● These embeddings capture general similarity
Randomly-
initialized
Large
Language
Model
(>100 million
weights)
Neural IR: the good, the bad, and the ugly
• The good
• Big improvements in search quality
• Key ingredient: putting words in context
• Nice property: zero-shot ranking works
• The bad
• The ugly
21
source
We’re making a significant improvement to how we
understand queries, representing the biggest leap
forward in the past five years, and one of the
biggest leaps forward in the history of Search.
Starting from April of this year (2019), we used
large transformer models to deliver the largest
quality improvements to our Bing customers in
the past year.
MS Bing
Google Search
source
22
Search quality improvements: industry
Source: Yang et al., SIGIR 2019
Yilmaz et al., 2019;
MacAvaney et al. 2019
Li et al., 2020;
Nogueira et al., 2020
23
Robust04 Dataset (news articles)
Search quality improvements: academia
~8 points!
Search quality improvements: academia
24
MS MARCO Passage Dataset (Web pages)
Source: MS MARCO Leaderboard
19 point improvement: BM25 vs. best neural
8 point improvement: neural pre-LLM vs. LLM
Neural IR: the good, the bad, and the ugly
• The good
• Big improvements in search quality
• Key ingredient: putting words in context
• Nice property: zero-shot ranking works
• The bad
• The ugly
25
Key ingredient: contextualization
Modern approaches use a LLM (BERT) to create contextualized embeddings
26
Relevant
Document
Nonrelevant
MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.
Key ingredient: contextualization
Score can be produced by placing document terms in similarity buckets,
then computing score based on the size of each bucket
27
sim = 1.0 0.6 < sim < 1.0 sim <= 0.6
Tesla 4 terms 3 terms 20 terms
net 1 term 8 terms 40 terms
worth 2 terms 2 terms 15 terms
MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.
Guo, Fan, Ai, Croft. A Deep Relevance Matching Model for Ad-hoc Retrieval. CIKM 2016.
Query
Terms
Similarity Buckets
Score(Tesla) = 4*a + 3*b + 20*c
Total score = Score(Tesla) + Score(net) + Score(worth)
Neural IR: the good, the bad, and the ugly
• The good
• Big improvements in search quality
• Key ingredient: putting words in context
• Nice property: zero-shot ranking works
• The bad
• The ugly
28
Zero-shot ranking
Zero-shot: “this is my first attempt at a task”
➔ We typically think of search as zero-shot
If Apple releases iDevice tomorrow, can immediately search for it
➔ In experiments, operationalized as “this is my first attempt at this dataset”
29
A Little Bit Is Worse Than None: Ranking with Limited Training Data. Zhang, Yates, Lin. SustaiNLP 2020.
30
Zero-shot ranking
A Little Bit Is Worse Than None: Ranking with Limited Training Data. Zhang, Yates, Lin. SustaiNLP 2020.
No improvement training with
up to 50% of in-domain data
Neural method
(zero-shot)
Traditional methods
Neural IR: the good, the bad, and the ugly
• The good
• The bad
• Which approach is best? Depends on what we measure
• Zero-shot effectiveness varies a lot
• The ugly
31
Many categories of queries are ‘solved’ by current systems
121171: define etruscans
1113256: what is reba mcentire's net
worth
701453: what is a statutory deed
Basic definition
Simple factoid QA
Easy description
What should we measure?
How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
But others are more difficult
1037798: who is robert gray
10486: cost of interior concrete flooring
Ambiguous
Ambiguous and localized
144862: did prohibition increased crime
Complex reasoning
507445: symptoms of different types of brain
bleeds
Concerns multiple entities
What should we measure?
How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
144862: did prohibition increased crime
Complex reasoning
507445: symptoms of different types of brain
bleeds
Concerns multiple entities
Hard + Interesting
How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
Criteria
• Non-Factoid: Not answerable by a single short answer
• Beyond single passage: More than a simple definition
• Answerable: Answerable solely from the provided context
• Text-focused: Non-text answers or calculations should be excluded
• Mostly well-formed: Not contain clear spelling or language errors
• Possibly Complex: Multiple entities, comparison, complex reasoning, or multiple answers
Entity Linking
Query
Annotations
Web Search
Engine Annotations
TREC Deep Learning Collection
Queries Runs
What makes a query difficult & interesting?
41%
Factoid 10%
List
18%
Long Answer
24%
Web Search
8%
Factoid 31%
List
27%
Long Answer
35%
Web Search
TREC Deep Learning Collection DL-HARD Collection
The query type matters
How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
41%
Factoid 10%
List
18%
Long Answer
24%
Web Search
8%
Factoid 31%
List
27%
Long Answer
35%
Web Search
TREC Deep Learning Collection DL-HARD Collection
The query type matters
How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
Swaps in system rankings
● rank 1 → rank 4
● rank 13 → rank 5
● bottom → bottom + 1
● etc.
Neural IR: the good, the bad, and the ugly
• The good
• The bad
• Which approach is best? Depends on what we measure
• Zero-shot effectiveness varies a lot
• The ugly
38
40
s
Zero-shot: reranking > dense retrieval
Instead of comparing terms, entire query/doc represented by single vector
Neural IR: the good, the bad, and the ugly
• The good
• The bad
• The ugly
• General similarity: prone to capturing bias and prone to
using shortcuts (which do not generalize)
• Little control over what knowledge is stored inside the neural model
41
42
Term gender doctor who women dr medical her pregnancy … birth baby number
Weight 112 96 71 67 60 55 52 43 … 28 24 7
What’s inside a dense representation?
query
Tesla’s net worth
0.0
0.2
0.3
2.5
1.7
0.8
query vector
Converter
term weights
tesla
worth
net
price
car
…
…
…
32
23
20
18
11
Query: what is the gender of that doctor?
Ongoing work with Thong Nguyen, IRLab, UvA.
Neural IR: the good, the bad, and the ugly
• The good
• The bad
• The ugly
• General similarity: prone to capturing bias and prone to
using shortcuts (which do not generalize)
• Little control over what knowledge is stored inside the neural model
43
44
Language Models as Knowledge Bases? Petroni, Rocktaeschel, Lewis, Bakhtin, Wu, Miller, Riedel. EMNLP 2019.
Language Models As or For Knowledge Bases. Razniewski, Yates, Kassner, Weikum. DL4KG 2021.
What’s inside a large language model?
Probe a LLM’s word predictions to see how it behaves
➔ Alan Turing was born in [MASK]
Advanced LLMs are often correct, but fundamental issues remain
• Epistemic: Probabilities ≠ knowledge
• Birthplace(Biden) – Scranton (83%), New Castle (0.2%)
• Birth country (Bengio) – France (3.5%), Canada (3.2%)
• Fallbacks to correlations
• Are all Greeks born in Athens?
• Are all Panagiotis’ Greek?
• Unaware of limits
• Microsoft was acquired by … Novell (84%)
• Sun Microsystems was acquired by … Oracle (84%)
Conclusion
• The good: neural methods have significantly improved ranking quality
• The bad: so far, no fast & effective "off the shelf" approach for general use
• The ugly: is the bad caused by fundamental limitations of this paradigm?
45
Podcast on Neural IR
Available: Spotify, iTunes, …
https://arxiv.org/abs/2010.06467
https://bit.ly/tr4tr-book
Conclusion
• The good: neural methods have significantly improved ranking quality
• The bad: so far, no fast & effective "off the shelf" approach for general use
• The ugly: is the bad caused by fundamental limitations of this paradigm?
46
Podcast on Neural IR
Available: Spotify, iTunes, …
https://arxiv.org/abs/2010.06467
https://bit.ly/tr4tr-book
Thanks!