SlideShare une entreprise Scribd logo
1  sur  93
AFIRM: ACM SIGIR/SIGKDD Africa Summer School on Machine Learning for Data Mining and Search
Learning to Rank with Neural Networks
Instructors
Bhaskar Mitra, Microsoft & University College London, Canada
Nick Craswell, Microsoft, USA
Emine Yilmaz, University College London
Daniel Campos, Microsoft, USA
January 2020
The Instructors
BHASKAR MITRA NICK CRASWELL EMINE YILMAZ DANIEL CAMPOS
Microsoft, USA
nickcr@microsoft.com
@nick_craswell
Microsoft, USA
dacamp@microsoft.com
@spacemanidol
Microsoft & UCL, Canada
bmitra@microsoft.com
@underdoggeek
UCL & Microsoft, Canada
emine.yilmaz@ucl.ac.uk
@xxEmineYilmazxx
Download the slides:
http://bit.ly/ltr-nn-afirm2020
Download the free book:
http://bit.ly/neuralir-intro
Download the lab exercises:
https://github.com/spacemanidol/AFIRMDeepLearning2020
Download TREC Deep Learning Track data:
https://microsoft.github.io/TREC-2019-Deep-Learning/
RESOURCES
AGENDA
NEURAL NETWORKS
45 MINS
VECTORS, MATRICES,
AND TENSORS
Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66
Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
matrix transpose matrix addition
dot product matrix multiplication
SUPERVISED LEARNING
Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
NEURAL NETWORKS
A simple neural network transforms an input feature vector to
produce an output vector by applying sequence of parameterized
linear transforms (e.g., matrix multiply with weights, add bias
vector) and element-wise non-linear transforms (e.g., tanh, relu)
The parameters are trained using gradient descent to minimize
some loss function specified over predicted and expected outputs
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backwardpass
Expected output
loss
Tanh ReLU
FUNDAMENTAL
MACHINE
LEARNING TASKS
SQUARED LOSS
The squared loss is a popular loss function for regression tasks
THE SOFTMAX FUNCTION
In neural classification models, the softmax function is popularly used to normalize
the neural network output scores across all the classes
CROSS ENTROPY
The cross entropy between two probability
distributions 𝑝 and 𝑞 over a discrete set of
events is given by,
If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all
other values of 𝑖 then,
CROSS ENTROPY WITH
SOFTMAX LOSS
Cross entropy with softmax is a popular loss
function for classification
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕𝑙
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
STOCHASTIC GRADIENT DESCENT (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
…and repeat
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕 𝑦 − 𝑦2
2
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑦1 + 𝑏2 × 𝑤2 ×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑦1 + 𝑏2 × 𝑤2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑦1 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤1. 𝑥 + 𝑏1 × 𝑥
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
STOCHASTIC GRADIENT DESCENT (SGD)
COMPUTATION
NETWORKS
The “Lego” approach to specifying neural architectures
Library of neural layers, each layer defines logic for:
1. Forward pass: compute layer output given layer input
2. Backward pass:
a) compute gradient of layer output w.r.t. layer inputs
b) compute gradient of layer output w.r.t. layer parameters (if any)
Chain nodes to create bigger and more complex networks
TOOLKITS
A diverse set of options
to choose from!
Figure from https://towardsdatascience.com/battle-of-
the-deep-learning-frameworks-part-i-cff0e3841750
TRAINING A SIMPLE IMAGE CLASSIFIER W/ PYTORCH
First, we define the model
architecture
Next, we specify loss function and
optimization algorithm
Finally, loop over training data to
optimize model parameters
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
REALLY DEEP
NEURAL NETWORKS
(Larsson et al., 2016) (He et al., 2015) (Szegedy et al., 2014)
can’t separate using a linear model!
Input features
Label
surface kerberos book library
1 0 1 0 ✓
1 1 0 0 ✗
0 1 0 1 ✓
0 0 1 1 ✗
library booksurface kerberos
+0.5
+0.5
-1
-1 -1
-1
+1 +1
+0.5
+0.5
H1 H2
But let’s consider a tiny neural
network with one hidden layer…
VISUAL
MOTIVATION FOR
HIDDEN LAYERS
Consider the following “toy” challenge for
classifying tech queries:
Vocab: {surface, kerberos, book, library}
Labels:
“surface book”, “kerberos library” ✓
“kerberos surface”, “library book” ✗
VISUAL
MOTIVATION FOR
HIDDEN LAYERS
Or more succinctly…
Input features Hidden layer
Label
surface kerberos book library H1 H2
1 0 1 0 1 0 ✓
1 1 0 0 0 0 ✗
0 1 0 1 0 1 ✓
0 0 1 1 0 0 ✗
library booksurface kerberos
+0.5
+0.5
-1
-1 -1
-1
+1 +1
+0.5
+0.5
H1 H2
But let’s consider a tiny neural
network with one hidden layer…
can separate using a linear model!
Consider the following “toy” challenge for
classifying tech queries:
Vocab: {surface, kerberos, book, library}
Labels:
“surface book”, “kerberos library” ✓
“kerberos surface”, “library book” ✗
WHY ADDING DEPTH HELPS
http://playground.tensorflow.org
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019.
THE LOTTERY
TICKET HYPOTHESIS
BIAS-VARIANCE
TRADE-OFF
https://medium.com/@akgone38/what-the-heck-bias-variance-tradeoff-is-fe4681c0e71b
x = 20 samples in range [0-10)
array([0.90958323, 1.92243063, 2.08584585, 2.20797776, 2.67748774, 2.74804427, 3.30582528, 4.61371217,
4.82180332, 5.05056425, 5.17943809, 5.24673789, 5.25498203, 6.51998081, 6.69507593, 7.41185813,
8.30728588, 8.49480071, 8.51663415, 9.65215509])
array([-6.98410155, -2.72058483, 1.83675969, 7.01024352, 8.57596003, 4.33476856, 18.5227248 , 23.40644589,
24.19983813, 20.17728703, 21.67505609, 28.50143303, 39.2529178 , 38.61850753, 50.06987467, 54.84458601,
75.04156323, 64.05990929, 69.81111628, 93.59637334])
y = x**2 + noise
BIAS-VARIANCE TRADE-OFF IN THE
DEEP LEARNING ERA
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. In PNAS, 2019.
MOST IR SYSTEMS PRESENT
RANKED LISTS OF RETRIEVED
INFORMATION ARTIFACTS
THE UNREASONABLE EFFECTIVENESS
OF SIMPLE LTR BASED APPROACHES
LEARNING TO
RANK (LTR)
”... the task to automatically construct a ranking
model using training data, such that the model
can sort new objects according to their degrees
of relevance, preference, or importance.”
- Liu [2009]
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
Phase 1 Phase 2
LEARNING TO
RANK (LTR)
L2R models represent a rankable item—e.g.,
a document—given some context—e.g., a
user-issued query—as a numerical vector
𝑥 ∈ ℝ 𝑛
The ranking model 𝑓: 𝑥 → ℝ is trained to
map the vector to a real-valued score such
that relevant items are scored higher.
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
Phase 1 Phase 2
WHY IS RANKING CHALLENGING?
Ideally: Train a machine learning model to optimize for a rank
based metric
Challenge: Rank based metrics, such as DCG or MRR, are non-
smooth / non-differentiable
WHY IS RANKING CHALLENGING?
Examples of ranking metrics
Discounted Cumulative Gain (DCG)
𝐷𝐶𝐺@𝑘 =
𝑖=1
𝑘
2 𝑟𝑒𝑙𝑖
− 1
𝑙𝑜𝑔2 𝑖 + 1
Reciprocal Rank (RR)
𝑅𝑅@𝑘 = max
1<𝑖<𝑘
𝑟𝑒𝑙𝑖
𝑖
Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
FEATURES
They can often be categorized as:
Query-independent or static features
e.g., incoming link count and document length
Query-dependent or dynamic features
e.g., BM25
Query-level features
e.g., query length
Traditional L2R models employ
hand-crafted features that
encode IR insights
FEATURES
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
APPROACHES
Pointwise approach
Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human
judgments or implicit user feedback (e.g., CTR). Typically, a regression or
classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑.
Pairwise approach
Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as
label. Reduces to binary classification to predict more relevant document.
Listwise approach
Directly optimize for rank-based metric, such as NDCG—difficult because
these metrics are often not differentiable w.r.t. model parameters.
Liu [2009] categorizes
different LTR approaches
based on training objectives:
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
POINTWISE
OBJECTIVES
Regression loss
Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑
e.g., square loss for binary or categorical labels,
where, 𝑦 𝑞,𝑑 is the one-hot representation [Fuhr, 1989] or the
actual value [Cossock and Zhang, 2006] of the label
Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989.
David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006.
labels
prediction
POINTWISE
OBJECTIVES
Classification loss
Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑
e.g., cross-entropy with softmax over
categorical labels 𝑌 [Li et al., 2008],
where, 𝑠 𝑦 𝑞,𝑑
is the model’s score for label 𝑦 𝑞,𝑑
labels
prediction
0 1
Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
PAIRWISE
OBJECTIVES Pairwise loss generally has the following form [Chen et al., 2009],
where, 𝜙 can be,
• Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000]
• Exponential function 𝜙 𝑧 = 𝑒−𝑧
[Freund et al., 2003]
• Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧
[Burges et al., 2005]
• Others…
Pairwise loss minimizes the average number of
inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is
ranked higher than 𝑑𝑖
Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document
For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 ,
Feature vectors: 𝑥𝑖 and 𝑥𝑗
Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
PAIRWISE
OBJECTIVES
RankNet loss
Pairwise loss function proposed by Burges et al. [2005]—an industry favourite
[Burges, 2015]
Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡
𝑒 𝛾.𝑠 𝑖
𝑒 𝛾.𝑠 𝑖 +𝑒
𝛾.𝑠 𝑗
=
1
1+𝑒
−𝛾. 𝑠 𝑖−𝑠 𝑗
Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0
Computing cross-entropy between 𝑝 and 𝑝
ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗
pairwise
preference
score
0 1
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
A GENERALIZED CROSS-ENTROPY LOSS
An alternative loss function assumes a single relevant document 𝑑+ and compares it
against the full collection 𝐷
Predicted probabilities: p 𝑑+|𝑞 =
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
The cross-entropy loss is then given by,
ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
Computing the softmax over the full collection is prohibitively expensive—LTR models
typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise
errors less for right
Due to strong position-based discounting in
IR measures, errors at higher ranks are much
more problematic than at lower ranks
But listwise metrics are non-continuous and
non-differentiable
LISTWISE
OBJECTIVES
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010.
[Burges, 2010]
LISTWISE
OBJECTIVES
Burges et al. [2006] make two observations:
1. To train a model we don’t need the costs
themselves, only the gradients (of the costs
w.r.t model scores)
2. It is desired that the gradient be bigger for
pairs of documents that produces a bigger
impact in NDCG by swapping positions
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006.
LambdaRank loss
Multiply actual gradients with the change in
NDCG by swapping the rank positions of the
two documents
LISTWISE
OBJECTIVES
According to the Placket Luce model [Luce,
2005], given four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the
probability of observing a particular rank-order,
say 𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by:
where, 𝜋 is a particular permutation and 𝜙 is a
transformation (e.g., linear, exponential, or
sigmoid) over the score 𝑠𝑖 corresponding to item
𝑑𝑖
R Duncan Luce. Individual choice behavior. 1959.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008.
ListNet loss
Cao et al. [2007] propose to compute the
probability distribution over all possible
permutations based on model score and ground-
truth labels. The loss is then given by the K-L
divergence between these two distributions.
This is computationally very costly, computing
permutations of only the top-K items makes it
slightly less prohibitive.
ListMLE loss
Xia et al. [2008] propose to compute the
probability of the ideal permutation based on the
ground truth. However, with categorical labels
more than one permutation is possible.
LISTWISE
OBJECTIVES
Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009.
Smooth DCG
Wu et al. [2009] compute a “smooth” rank of
documents as a function of their scores
This “smooth” rank can be plugged into a
ranking metric, such as MRR or DCG, to
produce a smooth ranking loss
DEEP LEARNING TO RANK
60 MINS
THE STATE OF NEURAL INFORMATION RETRIEVAL
GROWING PUBLICATION POPULARITY
AT TOP IR CONFERENCES
STRONG PERFORMANCE AGAINST
TRADITIONAL METHODS IN TREC 2019
LATENT REPRESENTATION LEARNING FOR TEXT
Inspecting non-query terms in the document may reveal important clues about whether the
document is relevant to the query
albuquerque
Passage about Albuquerque Passage not about Albuquerque
DEEP STRUCTURED
SEMANTIC MODEL
• Learn latent dense vector representation of
query and document text
• Relevance is estimated by cosine similarity
between query and document
embeddings
• Relevant document embeddings should
be more similar to query embeddings than
non-relevant document embeddings
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
BUT HOW CAN WE INPUT TEXT INTO A
NEURAL MODEL?
DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
DEEP STRUCTURED
SEMANTIC MODEL
To train the model we can use any of the loss
functions we learned about in the last lecture
Cross-entropy loss against randomly sampled
negative documents is commonly used
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
SHIFT-INVARIANT
NEURAL OPERATIONS
Detecting a pattern in one part of the input space is similar to
detecting it in another
Leverage redundancy by moving a window over the whole
input space and then aggregate
On each instance of the window a kernel—also known as a
filter or a cell—is applied
Different aggregation strategies lead to different architectures
CONVOLUTION
Move the window over the input space each time applying the
same cell over the window
A typical cell operation can be,
ℎ = 𝜎 𝑊𝑋 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels]
Cell Output [1 x out_channels]
Full Output [1 + (words – window) / stride x out_channels]
POOLING
Move the window over the input space each time applying an
aggregate function over each dimension in within the window
ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗
Full Input [words x channels]
Cell Input [window x channels]
Cell Output [1 x channels]
Full Output [1 + (words – window) / stride x channels]
max -pooling average -pooling
CONVOLUTION W/
GLOBAL POOLING
Stacking a global pooling layer on top of a convolutional layer
is a common strategy for generating a fixed length embedding
for a variable length text
Full Input [words x in_channels]
Full Output [1 x out_channels]
RECURRENCE
Similar to a convolution layer but additional dependency on
previous hidden state
A simple cell operation shown below but others like LSTM and
GRUs are more popular in practice,
ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels] + [1 x out_channels]
Cell Output [1 x out_channels]
Full Output [1 x out_channels]
CONVOLUTIONAL
DSSM (CDSSM)
Replace bag-of-words assumption by concatenating
term vectors in a sequence on the input
Convolution followed by global max-pooling
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
INTERACTION-BASED
NETWORKS
Typically a document is relevant if some part of the
document contains information relevant to the query
Interaction matrix 𝑋—where 𝑥𝑖𝑗 is obtained by comparing
the ith window over query terms with the jth window over the
document terms—captures evidence of relevance from
different parts of the document
Additional neural network layers can inspect the interaction
matrix and aggregate the evidence to estimate overall
relevance
Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS, 2013.
KERNEL POOLING
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In SIGIR, 2017.
Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In WSDM, 2018.
LEXICAL AND SEMANTIC
MATCHING NETWORKS
Mitra et al. [2016] argue that both lexical and
semantic matching is important for
document ranking
Duet model is a linear combination of two
DNNs—focusing on lexical and semantic
matching, respectively—jointly trained on
labelled data
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
LEXICAL AND SEMANTIC
MATCHING NETWORKS
Lexical sub-model operates over input matrix 𝑋
𝑥𝑖,𝑗 =
1, 𝑖𝑓 𝑡 𝑞,𝑖 = 𝑡 𝑑,𝑗
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
In relevant documents,
1. Many matches, typically in clusters
2. Matches localized early in document
3. Matches for all query terms
4. In-order (phrasal) matches
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
Duet implementation on PyTorch
https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb
GET THE CODE
MANY OTHER NEURAL ARCHITECTURES
(Palangi et al., 2015)
(Kalchbrenner et al., 2014)
(Denil et al., 2014)
(Kim, 2014)
(Severyn and Moschitti, 2015)
(Zhao et al., 2015) (Hu et al., 2014)
(Tai et al., 2015)
(Guo et al., 2016)
(Hui et al., 2017)
(Pang et al., 2017)
(Jaech et al., 2017)
(Dehghani et al., 2017)
Impact across both academia and industry
BERT FOR RANKING
ATTENTION
Given a set of n items and an input context, produce a
probability distribution {a1, …, ai, …, an} of attending to each item
as a function of similarity between a learned representation (q)
of the context and learned representations (ki) of the items
𝑎𝑖 =
𝜑 𝑞, 𝑘𝑖
𝑗
𝑛
𝜑 𝑞, 𝑘𝑗
The aggregated output is given by 𝑖
𝑛
𝑎𝑖 ∙ 𝑣𝑖
Full Input [words x in_channels], [1 x ctx_channels]
Full Output [1 x out_channels]
* When attending over a sequence (and not a set), the key k and value
v are typically a function of the item and some encoding of the position
SELF ATTENTION
Given a sequence (or set) of n items, treat each item as the
context at a time and attend over the whole sequence (or set),
and repeat for all n items
Full Input [words x in_channels]
Full Output [words x out_channels]
SELF ATTENTION
Given a sequence (or set) of n items, treat each item as the
context at a time and attend over the whole sequence (or set),
and repeat for all n items
Full Input [words x in_channels]
Full Output [words x out_channels]
SELF ATTENTION
Given a sequence (or set) of n items, treat each item as the
context at a time and attend over the whole sequence (or set),
and repeat for all n items
Full Input [words x in_channels]
Full Output [words x out_channels]
TRANSFORMERS
A transformer layer consists of a combination of self-
attention layer and multiple fully-connected or
convolutional layers, with residual connections
A transformer-based encoder can consist of multiple
transformers stacked in sequence
Full Input [words x in_channels]
Full Output [words x out_channels]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
LANGUAGE MODELING
A family of language modeling tasks have been
explored in the literature, including:
• Predict next word in a sequence
• Predict masked word in a sequence
• Predict next sentence
Fundamentally the same idea as word2vec and older
neural LMs—but with deeper models and considering
dependencies across longer distances between terms
w1 [MASK]w2 w4
model
?
loss
w3
CONTEXTUALIZED
DEEP WORD
EMBEDDINGS
http://jalammar.github.io/illustrated-bert/
Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.
BERT
Stacked transformer layers
Pretrained on two tasks:
• Masked language modeling
• Next sentence prediction
Input: WordPiece embedding +
position embedding + segment
embedding
Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
BERT FOR RANKING
BERT (and other large-scale unsupervised language models) are
demonstrating dramatic performance improvements on many IR tasks
Rodrigo Nogueira, and Kyunghyun Cho. Passage Re-ranking with BERT. In arXiv, 2019.
MS MARCO
Query Passage Pair
Query Passage
score
DEEP LEARNING
@ TREC
If you are looking for interesting research
topics at the intersection of machine learning
and search, come participate in the track!
GOAL: LARGE, HUMAN-LABELED, OPEN IR DATA
200K queries, human-labeled, proprietary
Past: Weak supervision Here: Two new datasetsPast: Proprietary data
1+M queries, weak supervision, open 300+K queries, human-labeled, open
Mitra, Diaz and Craswell. Learning to match using local
and distributed representations of text for web search.
WWW 2017
Dehghani, Zamani, Severyn, Kamps and Croft.
Neural ranking models with weak supervision.
SIGIR 2017
More data
Bettersearchresults
TREC 2019 Deep Learning Track
GENERATING PUBLIC BENCHMARKS FOR NEURAL IR
RESEARCH
A public retrieval and ranking benchmark
with large scale training data (~400K
queries with manual relevance labels)
DERIVING OUR TREC 2019 DATASETS
MS MARCO QnA
Leaderboard
• 1M real queries
• 10 passages per Q
• Human annotation
says ~1 of 10
answers the query
MS MARCO Passage
Retrieval Leaderboard
• Corpus: Union of
10-passage sets
• Labels: From the
~1 positive passage
TREC 2019 Task:
Passage Retrieval
• Same corpus,
training Q+labels
• New reusable NIST
test set
TREC 2019 Task:
Document Retrieval
• Corpus:
Documents (crawl
passage urls)
• Labels: Transfer
from passage to
doc
• New reusable NIST
test set
http://msmarco.org
https://microsoft.github.io/TREC-2019-Deep-Learning/
SETUP OF THE 2019 DEEP LEARNING TRACK
• Key question: What works best in a large-data regime?
• “nnlm”: Runs that use a BERT-style language model
• “nn”: Runs that do representation learning
• “trad”: Runs using only traditional IR features (such as BM25 and RM3)
• Subtasks:
• “fullrank”: End-to-end retrieval
• “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25.
Task Training data Test data Corpus
1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents
2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass
labels
8.8M passages
* Mostly-overlapping query sets (41 shared)
DATASET AVAILABILITY
• Corpus + train + dev data for both tasks
available now from the DL Track site*
• NIST test sets available to participants now
• [Broader availability in Feb 2020]
* https://microsoft.github.io/TREC-2019-Deep-Learning/
SUMMARY OF TREC 2019 DEEP LEARNING TRACK RESULTS
THANK YOU
NEXT: LEARNING TO RANK LAB SESSIONS @ 1PM

Contenu connexe

Tendances

Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMark Chang
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
Text classification
Text classificationText classification
Text classificationJames Wong
 
About Unsupervised Image-to-Image Translation
About Unsupervised Image-to-Image TranslationAbout Unsupervised Image-to-Image Translation
About Unsupervised Image-to-Image TranslationMehdi Shibahara
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial NetworksDong Heon Cho
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsHuiji Gao
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
 
Research of adversarial example on a deep neural network
Research of adversarial example on a deep neural networkResearch of adversarial example on a deep neural network
Research of adversarial example on a deep neural networkNAVER Engineering
 
Network Biology Lent 2010 - lecture 1
Network Biology Lent 2010 - lecture 1Network Biology Lent 2010 - lecture 1
Network Biology Lent 2010 - lecture 1Florian Markowetz
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGrubhubTech
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 

Tendances (20)

Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Text categorization
Text categorizationText categorization
Text categorization
 
Text classification
Text classificationText classification
Text classification
 
About Unsupervised Image-to-Image Translation
About Unsupervised Image-to-Image TranslationAbout Unsupervised Image-to-Image Translation
About Unsupervised Image-to-Image Translation
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Basic Generative Adversarial Networks
Basic Generative Adversarial NetworksBasic Generative Adversarial Networks
Basic Generative Adversarial Networks
 
Deep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender SystemsDeep Natural Language Processing for Search and Recommender Systems
Deep Natural Language Processing for Search and Recommender Systems
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
Research of adversarial example on a deep neural network
Research of adversarial example on a deep neural networkResearch of adversarial example on a deep neural network
Research of adversarial example on a deep neural network
 
Network Biology Lent 2010 - lecture 1
Network Biology Lent 2010 - lecture 1Network Biology Lent 2010 - lecture 1
Network Biology Lent 2010 - lecture 1
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 

Similaire à Learning to Rank with Neural Networks

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowOswald Campesato
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Oswald Campesato
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowOswald Campesato
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningBig_Data_Ukraine
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
 
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineFast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineSoma Boubou
 

Similaire à Learning to Rank with Neural Networks (20)

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Eye deep
Eye deepEye deep
Eye deep
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Android and Deep Learning
Android and Deep LearningAndroid and Deep Learning
Android and Deep Learning
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Pytorch meetup
Pytorch meetupPytorch meetup
Pytorch meetup
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning MachineFast Object Recognition from 3D Depth Data with Extreme Learning Machine
Fast Object Recognition from 3D Depth Data with Extreme Learning Machine
 

Plus de Bhaskar Mitra

Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcomeBhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Bhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Bhaskar Mitra
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
 

Plus de Bhaskar Mitra (15)

Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcome
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
 
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
Query Expansion with Locally-Trained Word Embeddings (Neu-IR 2016)
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas Mikolov
 

Dernier

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 

Dernier (20)

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 

Learning to Rank with Neural Networks

  • 1. AFIRM: ACM SIGIR/SIGKDD Africa Summer School on Machine Learning for Data Mining and Search Learning to Rank with Neural Networks Instructors Bhaskar Mitra, Microsoft & University College London, Canada Nick Craswell, Microsoft, USA Emine Yilmaz, University College London Daniel Campos, Microsoft, USA January 2020
  • 2. The Instructors BHASKAR MITRA NICK CRASWELL EMINE YILMAZ DANIEL CAMPOS Microsoft, USA nickcr@microsoft.com @nick_craswell Microsoft, USA dacamp@microsoft.com @spacemanidol Microsoft & UCL, Canada bmitra@microsoft.com @underdoggeek UCL & Microsoft, Canada emine.yilmaz@ucl.ac.uk @xxEmineYilmazxx
  • 3. Download the slides: http://bit.ly/ltr-nn-afirm2020 Download the free book: http://bit.ly/neuralir-intro Download the lab exercises: https://github.com/spacemanidol/AFIRMDeepLearning2020 Download TREC Deep Learning Track data: https://microsoft.github.io/TREC-2019-Deep-Learning/ RESOURCES
  • 6. VECTORS, MATRICES, AND TENSORS Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66 Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/ matrix transpose matrix addition dot product matrix multiplication
  • 7.
  • 8. SUPERVISED LEARNING Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
  • 9. NEURAL NETWORKS A simple neural network transforms an input feature vector to produce an output vector by applying sequence of parameterized linear transforms (e.g., matrix multiply with weights, add bias vector) and element-wise non-linear transforms (e.g., tanh, relu) The parameters are trained using gradient descent to minimize some loss function specified over predicted and expected outputs Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backwardpass Expected output loss Tanh ReLU
  • 11. SQUARED LOSS The squared loss is a popular loss function for regression tasks
  • 12. THE SOFTMAX FUNCTION In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes
  • 13. CROSS ENTROPY The cross entropy between two probability distributions 𝑝 and 𝑞 over a discrete set of events is given by, If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all other values of 𝑖 then,
  • 14. CROSS ENTROPY WITH SOFTMAX LOSS Cross entropy with softmax is a popular loss function for classification
  • 15. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕𝑙 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 STOCHASTIC GRADIENT DESCENT (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 …and repeat 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
  • 16. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕 𝑦 − 𝑦2 2 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 17. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 18. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 19. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑦1 + 𝑏2 × 𝑤2 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 20. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑦1 + 𝑏2 × 𝑤2 × 𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 21. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑦1 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤1. 𝑥 + 𝑏1 × 𝑥 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat STOCHASTIC GRADIENT DESCENT (SGD)
  • 22. COMPUTATION NETWORKS The “Lego” approach to specifying neural architectures Library of neural layers, each layer defines logic for: 1. Forward pass: compute layer output given layer input 2. Backward pass: a) compute gradient of layer output w.r.t. layer inputs b) compute gradient of layer output w.r.t. layer parameters (if any) Chain nodes to create bigger and more complex networks
  • 23. TOOLKITS A diverse set of options to choose from! Figure from https://towardsdatascience.com/battle-of- the-deep-learning-frameworks-part-i-cff0e3841750
  • 24. TRAINING A SIMPLE IMAGE CLASSIFIER W/ PYTORCH First, we define the model architecture Next, we specify loss function and optimization algorithm Finally, loop over training data to optimize model parameters https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py
  • 25. REALLY DEEP NEURAL NETWORKS (Larsson et al., 2016) (He et al., 2015) (Szegedy et al., 2014)
  • 26. can’t separate using a linear model! Input features Label surface kerberos book library 1 0 1 0 ✓ 1 1 0 0 ✗ 0 1 0 1 ✓ 0 0 1 1 ✗ library booksurface kerberos +0.5 +0.5 -1 -1 -1 -1 +1 +1 +0.5 +0.5 H1 H2 But let’s consider a tiny neural network with one hidden layer… VISUAL MOTIVATION FOR HIDDEN LAYERS Consider the following “toy” challenge for classifying tech queries: Vocab: {surface, kerberos, book, library} Labels: “surface book”, “kerberos library” ✓ “kerberos surface”, “library book” ✗
  • 27. VISUAL MOTIVATION FOR HIDDEN LAYERS Or more succinctly… Input features Hidden layer Label surface kerberos book library H1 H2 1 0 1 0 1 0 ✓ 1 1 0 0 0 0 ✗ 0 1 0 1 0 1 ✓ 0 0 1 1 0 0 ✗ library booksurface kerberos +0.5 +0.5 -1 -1 -1 -1 +1 +1 +0.5 +0.5 H1 H2 But let’s consider a tiny neural network with one hidden layer… can separate using a linear model! Consider the following “toy” challenge for classifying tech queries: Vocab: {surface, kerberos, book, library} Labels: “surface book”, “kerberos library” ✓ “kerberos surface”, “library book” ✗
  • 28. WHY ADDING DEPTH HELPS http://playground.tensorflow.org
  • 29. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019. Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019. THE LOTTERY TICKET HYPOTHESIS
  • 31. x = 20 samples in range [0-10) array([0.90958323, 1.92243063, 2.08584585, 2.20797776, 2.67748774, 2.74804427, 3.30582528, 4.61371217, 4.82180332, 5.05056425, 5.17943809, 5.24673789, 5.25498203, 6.51998081, 6.69507593, 7.41185813, 8.30728588, 8.49480071, 8.51663415, 9.65215509]) array([-6.98410155, -2.72058483, 1.83675969, 7.01024352, 8.57596003, 4.33476856, 18.5227248 , 23.40644589, 24.19983813, 20.17728703, 21.67505609, 28.50143303, 39.2529178 , 38.61850753, 50.06987467, 54.84458601, 75.04156323, 64.05990929, 69.81111628, 93.59637334]) y = x**2 + noise
  • 32.
  • 33.
  • 34. BIAS-VARIANCE TRADE-OFF IN THE DEEP LEARNING ERA Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. In PNAS, 2019.
  • 35. MOST IR SYSTEMS PRESENT RANKED LISTS OF RETRIEVED INFORMATION ARTIFACTS
  • 36. THE UNREASONABLE EFFECTIVENESS OF SIMPLE LTR BASED APPROACHES
  • 37. LEARNING TO RANK (LTR) ”... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” - Liu [2009] Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf Phase 1 Phase 2
  • 38. LEARNING TO RANK (LTR) L2R models represent a rankable item—e.g., a document—given some context—e.g., a user-issued query—as a numerical vector 𝑥 ∈ ℝ 𝑛 The ranking model 𝑓: 𝑥 → ℝ is trained to map the vector to a real-valued score such that relevant items are scored higher. Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf Phase 1 Phase 2
  • 39. WHY IS RANKING CHALLENGING? Ideally: Train a machine learning model to optimize for a rank based metric Challenge: Rank based metrics, such as DCG or MRR, are non- smooth / non-differentiable
  • 40. WHY IS RANKING CHALLENGING? Examples of ranking metrics Discounted Cumulative Gain (DCG) 𝐷𝐶𝐺@𝑘 = 𝑖=1 𝑘 2 𝑟𝑒𝑙𝑖 − 1 𝑙𝑜𝑔2 𝑖 + 1 Reciprocal Rank (RR) 𝑅𝑅@𝑘 = max 1<𝑖<𝑘 𝑟𝑒𝑙𝑖 𝑖 Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
  • 41. FEATURES They can often be categorized as: Query-independent or static features e.g., incoming link count and document length Query-dependent or dynamic features e.g., BM25 Query-level features e.g., query length Traditional L2R models employ hand-crafted features that encode IR insights
  • 42. FEATURES Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
  • 43. APPROACHES Pointwise approach Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑. Pairwise approach Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as label. Reduces to binary classification to predict more relevant document. Listwise approach Directly optimize for rank-based metric, such as NDCG—difficult because these metrics are often not differentiable w.r.t. model parameters. Liu [2009] categorizes different LTR approaches based on training objectives: Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
  • 44. POINTWISE OBJECTIVES Regression loss Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑 e.g., square loss for binary or categorical labels, where, 𝑦 𝑞,𝑑 is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989. David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006. labels prediction
  • 45. POINTWISE OBJECTIVES Classification loss Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑 e.g., cross-entropy with softmax over categorical labels 𝑌 [Li et al., 2008], where, 𝑠 𝑦 𝑞,𝑑 is the model’s score for label 𝑦 𝑞,𝑑 labels prediction 0 1 Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
  • 46. PAIRWISE OBJECTIVES Pairwise loss generally has the following form [Chen et al., 2009], where, 𝜙 can be, • Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000] • Exponential function 𝜙 𝑧 = 𝑒−𝑧 [Freund et al., 2003] • Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧 [Burges et al., 2005] • Others… Pairwise loss minimizes the average number of inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is ranked higher than 𝑑𝑖 Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 , Feature vectors: 𝑥𝑖 and 𝑥𝑗 Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗 Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
  • 47. PAIRWISE OBJECTIVES RankNet loss Pairwise loss function proposed by Burges et al. [2005]—an industry favourite [Burges, 2015] Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡ 𝑒 𝛾.𝑠 𝑖 𝑒 𝛾.𝑠 𝑖 +𝑒 𝛾.𝑠 𝑗 = 1 1+𝑒 −𝛾. 𝑠 𝑖−𝑠 𝑗 Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0 Computing cross-entropy between 𝑝 and 𝑝 ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗 pairwise preference score 0 1 Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005. Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
  • 48. A GENERALIZED CROSS-ENTROPY LOSS An alternative loss function assumes a single relevant document 𝑑+ and compares it against the full collection 𝐷 Predicted probabilities: p 𝑑+|𝑞 = 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 The cross-entropy loss is then given by, ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 Computing the softmax over the full collection is prohibitively expensive—LTR models typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 49. Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higher ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable LISTWISE OBJECTIVES Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010. [Burges, 2010]
  • 50. LISTWISE OBJECTIVES Burges et al. [2006] make two observations: 1. To train a model we don’t need the costs themselves, only the gradients (of the costs w.r.t model scores) 2. It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006. LambdaRank loss Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents
  • 51. LISTWISE OBJECTIVES According to the Placket Luce model [Luce, 2005], given four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability of observing a particular rank-order, say 𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by: where, 𝜋 is a particular permutation and 𝜙 is a transformation (e.g., linear, exponential, or sigmoid) over the score 𝑠𝑖 corresponding to item 𝑑𝑖 R Duncan Luce. Individual choice behavior. 1959. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008. ListNet loss Cao et al. [2007] propose to compute the probability distribution over all possible permutations based on model score and ground- truth labels. The loss is then given by the K-L divergence between these two distributions. This is computationally very costly, computing permutations of only the top-K items makes it slightly less prohibitive. ListMLE loss Xia et al. [2008] propose to compute the probability of the ideal permutation based on the ground truth. However, with categorical labels more than one permutation is possible.
  • 52. LISTWISE OBJECTIVES Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009. Smooth DCG Wu et al. [2009] compute a “smooth” rank of documents as a function of their scores This “smooth” rank can be plugged into a ranking metric, such as MRR or DCG, to produce a smooth ranking loss
  • 53. DEEP LEARNING TO RANK 60 MINS
  • 54. THE STATE OF NEURAL INFORMATION RETRIEVAL GROWING PUBLICATION POPULARITY AT TOP IR CONFERENCES STRONG PERFORMANCE AGAINST TRADITIONAL METHODS IN TREC 2019
  • 55. LATENT REPRESENTATION LEARNING FOR TEXT Inspecting non-query terms in the document may reveal important clues about whether the document is relevant to the query albuquerque Passage about Albuquerque Passage not about Albuquerque
  • 56. DEEP STRUCTURED SEMANTIC MODEL • Learn latent dense vector representation of query and document text • Relevance is estimated by cosine similarity between query and document embeddings • Relevant document embeddings should be more similar to query embeddings than non-relevant document embeddings Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
  • 57. BUT HOW CAN WE INPUT TEXT INTO A NEURAL MODEL?
  • 58. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
  • 59. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
  • 60. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
  • 61. DIFFERENT MODALITIES OF INPUT TEXT REPRESENTATION
  • 62. DEEP STRUCTURED SEMANTIC MODEL To train the model we can use any of the loss functions we learned about in the last lecture Cross-entropy loss against randomly sampled negative documents is commonly used Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
  • 63. SHIFT-INVARIANT NEURAL OPERATIONS Detecting a pattern in one part of the input space is similar to detecting it in another Leverage redundancy by moving a window over the whole input space and then aggregate On each instance of the window a kernel—also known as a filter or a cell—is applied Different aggregation strategies lead to different architectures
  • 64. CONVOLUTION Move the window over the input space each time applying the same cell over the window A typical cell operation can be, ℎ = 𝜎 𝑊𝑋 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] Cell Output [1 x out_channels] Full Output [1 + (words – window) / stride x out_channels]
  • 65. POOLING Move the window over the input space each time applying an aggregate function over each dimension in within the window ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 Full Input [words x channels] Cell Input [window x channels] Cell Output [1 x channels] Full Output [1 + (words – window) / stride x channels] max -pooling average -pooling
  • 66. CONVOLUTION W/ GLOBAL POOLING Stacking a global pooling layer on top of a convolutional layer is a common strategy for generating a fixed length embedding for a variable length text Full Input [words x in_channels] Full Output [1 x out_channels]
  • 67. RECURRENCE Similar to a convolution layer but additional dependency on previous hidden state A simple cell operation shown below but others like LSTM and GRUs are more popular in practice, ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] + [1 x out_channels] Cell Output [1 x out_channels] Full Output [1 x out_channels]
  • 68. CONVOLUTIONAL DSSM (CDSSM) Replace bag-of-words assumption by concatenating term vectors in a sequence on the input Convolution followed by global max-pooling Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
  • 69. INTERACTION-BASED NETWORKS Typically a document is relevant if some part of the document contains information relevant to the query Interaction matrix 𝑋—where 𝑥𝑖𝑗 is obtained by comparing the ith window over query terms with the jth window over the document terms—captures evidence of relevance from different parts of the document Additional neural network layers can inspect the interaction matrix and aggregate the evidence to estimate overall relevance Zhengdong Lu and Hang Li. A deep architecture for matching short texts. In NIPS, 2013.
  • 70. KERNEL POOLING Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In SIGIR, 2017. Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In WSDM, 2018.
  • 71. LEXICAL AND SEMANTIC MATCHING NETWORKS Mitra et al. [2016] argue that both lexical and semantic matching is important for document ranking Duet model is a linear combination of two DNNs—focusing on lexical and semantic matching, respectively—jointly trained on labelled data Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 72. LEXICAL AND SEMANTIC MATCHING NETWORKS Lexical sub-model operates over input matrix 𝑋 𝑥𝑖,𝑗 = 1, 𝑖𝑓 𝑡 𝑞,𝑖 = 𝑡 𝑑,𝑗 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 In relevant documents, 1. Many matches, typically in clusters 2. Matches localized early in document 3. Matches for all query terms 4. In-order (phrasal) matches Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 73. Duet implementation on PyTorch https://github.com/bmitra-msft/NDRM/blob/master/notebooks/Duet.ipynb GET THE CODE
  • 74. MANY OTHER NEURAL ARCHITECTURES (Palangi et al., 2015) (Kalchbrenner et al., 2014) (Denil et al., 2014) (Kim, 2014) (Severyn and Moschitti, 2015) (Zhao et al., 2015) (Hu et al., 2014) (Tai et al., 2015) (Guo et al., 2016) (Hui et al., 2017) (Pang et al., 2017) (Jaech et al., 2017) (Dehghani et al., 2017)
  • 75. Impact across both academia and industry BERT FOR RANKING
  • 76. ATTENTION Given a set of n items and an input context, produce a probability distribution {a1, …, ai, …, an} of attending to each item as a function of similarity between a learned representation (q) of the context and learned representations (ki) of the items 𝑎𝑖 = 𝜑 𝑞, 𝑘𝑖 𝑗 𝑛 𝜑 𝑞, 𝑘𝑗 The aggregated output is given by 𝑖 𝑛 𝑎𝑖 ∙ 𝑣𝑖 Full Input [words x in_channels], [1 x ctx_channels] Full Output [1 x out_channels] * When attending over a sequence (and not a set), the key k and value v are typically a function of the item and some encoding of the position
  • 77. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
  • 78. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
  • 79. SELF ATTENTION Given a sequence (or set) of n items, treat each item as the context at a time and attend over the whole sequence (or set), and repeat for all n items Full Input [words x in_channels] Full Output [words x out_channels]
  • 80. TRANSFORMERS A transformer layer consists of a combination of self- attention layer and multiple fully-connected or convolutional layers, with residual connections A transformer-based encoder can consist of multiple transformers stacked in sequence Full Input [words x in_channels] Full Output [words x out_channels] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • 81. LANGUAGE MODELING A family of language modeling tasks have been explored in the literature, including: • Predict next word in a sequence • Predict masked word in a sequence • Predict next sentence Fundamentally the same idea as word2vec and older neural LMs—but with deeper models and considering dependencies across longer distances between terms w1 [MASK]w2 w4 model ? loss w3
  • 82. CONTEXTUALIZED DEEP WORD EMBEDDINGS http://jalammar.github.io/illustrated-bert/ Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.
  • 83. BERT Stacked transformer layers Pretrained on two tasks: • Masked language modeling • Next sentence prediction Input: WordPiece embedding + position embedding + segment embedding Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
  • 84. BERT FOR RANKING BERT (and other large-scale unsupervised language models) are demonstrating dramatic performance improvements on many IR tasks Rodrigo Nogueira, and Kyunghyun Cho. Passage Re-ranking with BERT. In arXiv, 2019. MS MARCO Query Passage Pair Query Passage score
  • 85. DEEP LEARNING @ TREC If you are looking for interesting research topics at the intersection of machine learning and search, come participate in the track!
  • 86. GOAL: LARGE, HUMAN-LABELED, OPEN IR DATA 200K queries, human-labeled, proprietary Past: Weak supervision Here: Two new datasetsPast: Proprietary data 1+M queries, weak supervision, open 300+K queries, human-labeled, open Mitra, Diaz and Craswell. Learning to match using local and distributed representations of text for web search. WWW 2017 Dehghani, Zamani, Severyn, Kamps and Croft. Neural ranking models with weak supervision. SIGIR 2017 More data Bettersearchresults TREC 2019 Deep Learning Track
  • 87. GENERATING PUBLIC BENCHMARKS FOR NEURAL IR RESEARCH A public retrieval and ranking benchmark with large scale training data (~400K queries with manual relevance labels)
  • 88. DERIVING OUR TREC 2019 DATASETS MS MARCO QnA Leaderboard • 1M real queries • 10 passages per Q • Human annotation says ~1 of 10 answers the query MS MARCO Passage Retrieval Leaderboard • Corpus: Union of 10-passage sets • Labels: From the ~1 positive passage TREC 2019 Task: Passage Retrieval • Same corpus, training Q+labels • New reusable NIST test set TREC 2019 Task: Document Retrieval • Corpus: Documents (crawl passage urls) • Labels: Transfer from passage to doc • New reusable NIST test set http://msmarco.org https://microsoft.github.io/TREC-2019-Deep-Learning/
  • 89. SETUP OF THE 2019 DEEP LEARNING TRACK • Key question: What works best in a large-data regime? • “nnlm”: Runs that use a BERT-style language model • “nn”: Runs that do representation learning • “trad”: Runs using only traditional IR features (such as BM25 and RM3) • Subtasks: • “fullrank”: End-to-end retrieval • “rerank”: Top-k reranking. Doc: k=100 Indri QL. Pass: k=1000 BM25. Task Training data Test data Corpus 1) Document retrieval 367K queries w/ doc labels 43* queries w/ doc labels 3.2M documents 2) Passage retrieval 502K queries w/ pass labels 43* queries w/ pass labels 8.8M passages * Mostly-overlapping query sets (41 shared)
  • 90. DATASET AVAILABILITY • Corpus + train + dev data for both tasks available now from the DL Track site* • NIST test sets available to participants now • [Broader availability in Feb 2020] * https://microsoft.github.io/TREC-2019-Deep-Learning/
  • 91. SUMMARY OF TREC 2019 DEEP LEARNING TRACK RESULTS
  • 92.
  • 93. THANK YOU NEXT: LEARNING TO RANK LAB SESSIONS @ 1PM