SlideShare une entreprise Scribd logo
1  sur  47
Neural Learning to Rank
Bhaskar Mitra
Principal Applied Scientist, Microsoft
PhD candidate, University College London
@UnderdogGeek
Topics
A quick recap of neural networks
The fundamentals of learning to rank
Reading material
An Introduction to
Neural Information Retrieval
Foundations and Trends® in Information Retrieval
(December 2018)
Download PDF: http://bit.ly/fntir-neural
Most information retrieval
(IR) systems present a ranked
list of retrieved artifacts
Learning to Rank (LTR)
”... the task to automatically construct a
ranking model using training data, such
that the model can sort new objects
according to their degrees of relevance,
preference, or importance.”
- Liu [2009]
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
A quick recap of
neural networks
Vectors, matrices,
and tensors
Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66
Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
matrix transpose matrix addition
dot product matrix multiplication
Supervised learning
Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
Neural networks
Chains of parameterized linear transforms (e.g., multiply weight, add
bias) followed by non-linear functions (σ)
Popular choices for σ:
Parameters trained using backpropagation
E2E training over millions of samples in batched mode
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backwardpass
Expected output
loss
Tanh ReLU
Basic machine
learning tasks
Squared loss
The squared loss is a popular loss function for regression tasks
The softmax function
In neural classification models, the softmax function is popularly used
to normalize the neural network output scores across all the classes
Cross entropy
The cross entropy between two
probability distributions 𝑝 and 𝑞
over a discrete set of events is
given by,
If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all
other values of 𝑖 then,
Cross entropy with
softmax loss
Cross entropy with softmax is a popular loss
function for classification
We are given training data: < 𝑥, 𝑦 > pairs, where 𝑥 is input and 𝑦 is expected output
Step 1: Define model and randomly initialize learnable model parameters
Step 2: Given 𝑥, compute model output
Step 3: Given model output and 𝑦, compute loss 𝑙
Step 4: Compute gradient
𝜕𝑙
𝜕𝑤
of loss 𝑙 w.r.t. each parameter 𝑤
Step 5: Update parameter as 𝑤 𝑛𝑒𝑤 = 𝑤 𝑜𝑙𝑑 − 𝜂 ×
𝜕𝑙
𝜕𝑤
, where 𝜂 is learning rate
Step 6: Go back to step 2 and repeat till convergence
Gradient Descent
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕𝑙
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Gradient Descent
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕 𝑦 − 𝑦2
2
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Gradient Descent
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Gradient Descent
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Gradient Descent
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 ×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Gradient Descent
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Gradient Descent
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤1. 𝑥 + 𝑏1 × 𝑥
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Gradient Descent
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Exercise
Simple Neural Network from Scratch
Implement a simple multi-layer neural network
with single input feature, single output, and
single neuron per layer using (i) PyTorch and
(ii) from scratch—and demonstrate that both
approaches produce identical outcome.
https://github.com/spacemanidol/AFIRMDeep
Learning2020/blob/master/NNPrimer.ipynb
Computation
Networks
The “Lego” approach to specifying neural architectures
Library of neural layers, each layer defines logic for:
1. Forward pass: compute layer output given layer input
2. Backward pass:
a) compute gradient of layer output w.r.t. layer inputs
b) compute gradient of layer output w.r.t. layer parameters (if any)
Chain nodes to create bigger and more complex networks
Why adding depth helps
http://playground.tensorflow.org
Bias-Variance trade-
off
https://medium.com/@akgone38/what-the-heck-bias-variance-tradeoff-is-fe4681c0e71b
Bias-variance trade-off in the deep
learning era
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. In PNAS, 2019.
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019.
The lottery ticket
hypothesis
Questions?
The fundamentals of
learning to rank
Problem formulation
LTR models represent a rankable item—e.g., a document or a movie or a
song—given some context—e.g., a user-issued query or user’s historical
interactions with other items—as a numerical vector 𝑥 ∈ ℝ 𝑛
The ranking model 𝑓: 𝑥 → ℝ is trained to map the vector to a real-valued
score such that relevant items are scored higher.
Why is ranking challenging?
Examples of ranking
metrics
Discounted Cumulative Gain (DCG)
𝐷𝐶𝐺@𝑘 =
𝑖=1
𝑘
2 𝑟𝑒𝑙𝑖
− 1
𝑙𝑜𝑔2 𝑖 + 1
Reciprocal Rank (RR)
𝑅𝑅@𝑘 = max
1<𝑖<𝑘
𝑟𝑒𝑙𝑖
𝑖
Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
Features
They can often be categorized as:
Query-independent or static features
e.g., incoming link count and document length
Query-dependent or dynamic features
e.g., BM25
Query-level features
e.g., query length
Traditional L2R models employ
hand-crafted features that
encode IR insights
Features
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
Approaches
Pointwise approach
Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human
judgments or implicit user feedback (e.g., CTR). Typically, a regression or
classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑.
Pairwise approach
Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as
label. Reduces to binary classification to predict more relevant document.
Listwise approach
Directly optimize for rank-based metric, such as NDCG—difficult because
these metrics are often not differentiable w.r.t. model parameters.
Liu [2009] categorizes
different LTR approaches
based on training objectives:
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Pointwise objectives
Regression loss
Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑
e.g., square loss for binary or categorical
labels,
where, 𝑦 𝑞,𝑑 is the one-hot representation
[Fuhr, 1989] or the actual value [Cossock and
Zhang, 2006] of the label
Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989.
David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006.
labels
prediction
0 1 1
Pointwise objectives
Classification loss
Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑
e.g., cross-entropy with softmax over
categorical labels 𝑌 [Li et al., 2008],
where, 𝑠 𝑦 𝑞,𝑑
is the model’s score for label 𝑦 𝑞,𝑑
labels
prediction
0 1
Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009],
where, 𝜙 can be,
• Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000]
• Exponential function 𝜙 𝑧 = 𝑒−𝑧
[Freund et al., 2003]
• Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧
[Burges et al., 2005]
• Others…
Pairwise loss minimizes the average number of
inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is
ranked higher than 𝑑𝑖
Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document
For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 ,
Feature vectors: 𝑥𝑖 and 𝑥𝑗
Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Pairwise objectives
RankNet loss
Pairwise loss function proposed by Burges et al. [2005]—an industry favourite
[Burges, 2015]
Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡
𝑒 𝛾.𝑠 𝑖
𝑒 𝛾.𝑠 𝑖 +𝑒
𝛾.𝑠 𝑗
=
1
1+𝑒
−𝛾. 𝑠 𝑖−𝑠 𝑗
Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0
Computing cross-entropy between 𝑝 and 𝑝
ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗
pairwise
preference
score
0 1
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
A generalized cross-entropy loss
An alternative loss function assumes a single relevant document 𝑑+ and compares it
against the full collection 𝐷
Predicted probabilities: p 𝑑+|𝑞 =
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
The cross-entropy loss is then given by,
ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
Computing the softmax over the full collection is prohibitively expensive—LTR models
typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise
errors less for right
Due to strong position-based discounting in
IR measures, errors at higher ranks are much
more problematic than at lower ranks
But listwise metrics are non-continuous and
non-differentiable
LISTWISE
OBJECTIVES
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010.
[Burges, 2010]
Listwise objectives
Burges et al. [2006] make two observations:
1. To train a model we don’t need the costs
themselves, only the gradients (of the costs
w.r.t model scores)
2. It is desired that the gradient be bigger for
pairs of documents that produces a bigger
impact in NDCG by swapping positions
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006.
LambdaRank loss
Multiply actual gradients with the change in
NDCG by swapping the rank positions of the
two documents
Listwise objectives
According to the Luce model [Luce, 2005],
given four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability
of observing a particular rank-order, say
𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by:
where, 𝜋 is a particular permutation and 𝜙 is a
transformation (e.g., linear, exponential, or
sigmoid) over the score 𝑠𝑖 corresponding to
item 𝑑𝑖
R Duncan Luce. Individual choice behavior. 1959.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008.
ListNet loss
Cao et al. [2007] propose to compute the
probability distribution over all possible
permutations based on model score and ground-
truth labels. The loss is then given by the K-L
divergence between these two distributions.
This is computationally very costly, computing
permutations of only the top-K items makes it
slightly less prohibitive.
ListMLE loss
Xia et al. [2008] propose to compute the
probability of the ideal permutation based on the
ground truth. However, with categorical labels
more than one permutation is possible.
Listwise objectives
Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009.
Smooth DCG
Wu et al. [2009] compute a “smooth” rank of
documents as a function of their scores
This “smooth” rank can be plugged into a
ranking metric, such as MRR or DCG, to
produce a smooth ranking loss
Questions?
@UnderdogGeek bmitra@microsoft.com

Contenu connexe

Tendances

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningDr. Radhey Shyam
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLHimadri Mishra
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoMLNing Jiang
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learningRidge-i, Inc.
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
GANs Presentation.pptx
GANs Presentation.pptxGANs Presentation.pptx
GANs Presentation.pptxMAHMOUD729246
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingSebastian Ruder
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)Thomas da Silva Paula
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learningsafa cimenli
 
Meta learning with memory augmented neural network
Meta learning with memory augmented neural networkMeta learning with memory augmented neural network
Meta learning with memory augmented neural networkKaty Lee
 
05 Classification And Prediction
05   Classification And Prediction05   Classification And Prediction
05 Classification And PredictionAchmad Solichin
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Hayim Makabee
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and ApplicationsGeeta Arora
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patternsKrish_ver2
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesKush Kulshrestha
 

Tendances (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
The Evolution of AutoML
The Evolution of AutoMLThe Evolution of AutoML
The Evolution of AutoML
 
CatBoost intro
CatBoost   introCatBoost   intro
CatBoost intro
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
GANs Presentation.pptx
GANs Presentation.pptxGANs Presentation.pptx
GANs Presentation.pptx
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs) A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Meta learning with memory augmented neural network
Meta learning with memory augmented neural networkMeta learning with memory augmented neural network
Meta learning with memory augmented neural network
 
05 Classification And Prediction
05   Classification And Prediction05   Classification And Prediction
05 Classification And Prediction
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and Applications
 
Birch
BirchBirch
Birch
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning Techniques
 

Similaire à Neural Learning to Rank

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfssuser7f0b19
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxPlacementsBCA
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonSimone Piunno
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfTigabu Yaya
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 

Similaire à Neural Learning to Rank (20)

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptx
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
Xgboost
XgboostXgboost
Xgboost
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 

Plus de Bhaskar Mitra

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?Bhaskar Mitra
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...Bhaskar Mitra
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcomeBhaskar Mitra
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Bhaskar Mitra
 

Plus de Bhaskar Mitra (20)

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and Recommendation
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and Recommendation
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcome
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
 

Dernier

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 

Dernier (20)

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 

Neural Learning to Rank

  • 1. Neural Learning to Rank Bhaskar Mitra Principal Applied Scientist, Microsoft PhD candidate, University College London @UnderdogGeek
  • 2. Topics A quick recap of neural networks The fundamentals of learning to rank
  • 3. Reading material An Introduction to Neural Information Retrieval Foundations and Trends® in Information Retrieval (December 2018) Download PDF: http://bit.ly/fntir-neural
  • 4. Most information retrieval (IR) systems present a ranked list of retrieved artifacts
  • 5. Learning to Rank (LTR) ”... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” - Liu [2009] Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
  • 6.
  • 7. A quick recap of neural networks
  • 8. Vectors, matrices, and tensors Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66 Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/ matrix transpose matrix addition dot product matrix multiplication
  • 9.
  • 10. Supervised learning Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
  • 11. Neural networks Chains of parameterized linear transforms (e.g., multiply weight, add bias) followed by non-linear functions (σ) Popular choices for σ: Parameters trained using backpropagation E2E training over millions of samples in batched mode Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backwardpass Expected output loss Tanh ReLU
  • 13. Squared loss The squared loss is a popular loss function for regression tasks
  • 14. The softmax function In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes
  • 15. Cross entropy The cross entropy between two probability distributions 𝑝 and 𝑞 over a discrete set of events is given by, If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all other values of 𝑖 then,
  • 16. Cross entropy with softmax loss Cross entropy with softmax is a popular loss function for classification
  • 17. We are given training data: < 𝑥, 𝑦 > pairs, where 𝑥 is input and 𝑦 is expected output Step 1: Define model and randomly initialize learnable model parameters Step 2: Given 𝑥, compute model output Step 3: Given model output and 𝑦, compute loss 𝑙 Step 4: Compute gradient 𝜕𝑙 𝜕𝑤 of loss 𝑙 w.r.t. each parameter 𝑤 Step 5: Update parameter as 𝑤 𝑛𝑒𝑤 = 𝑤 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤 , where 𝜂 is learning rate Step 6: Go back to step 2 and repeat till convergence Gradient Descent
  • 18. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕𝑙 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Gradient Descent Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 19. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕 𝑦 − 𝑦2 2 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Gradient Descent Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 20. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Gradient Descent Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 21. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Gradient Descent Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 22. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Gradient Descent Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 23. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Gradient Descent Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 24. Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤1. 𝑥 + 𝑏1 × 𝑥 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Gradient Descent Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 25. Exercise Simple Neural Network from Scratch Implement a simple multi-layer neural network with single input feature, single output, and single neuron per layer using (i) PyTorch and (ii) from scratch—and demonstrate that both approaches produce identical outcome. https://github.com/spacemanidol/AFIRMDeep Learning2020/blob/master/NNPrimer.ipynb
  • 26. Computation Networks The “Lego” approach to specifying neural architectures Library of neural layers, each layer defines logic for: 1. Forward pass: compute layer output given layer input 2. Backward pass: a) compute gradient of layer output w.r.t. layer inputs b) compute gradient of layer output w.r.t. layer parameters (if any) Chain nodes to create bigger and more complex networks
  • 27. Why adding depth helps http://playground.tensorflow.org
  • 29. Bias-variance trade-off in the deep learning era Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. In PNAS, 2019.
  • 30. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019. Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019. The lottery ticket hypothesis
  • 33. Problem formulation LTR models represent a rankable item—e.g., a document or a movie or a song—given some context—e.g., a user-issued query or user’s historical interactions with other items—as a numerical vector 𝑥 ∈ ℝ 𝑛 The ranking model 𝑓: 𝑥 → ℝ is trained to map the vector to a real-valued score such that relevant items are scored higher.
  • 34. Why is ranking challenging? Examples of ranking metrics Discounted Cumulative Gain (DCG) 𝐷𝐶𝐺@𝑘 = 𝑖=1 𝑘 2 𝑟𝑒𝑙𝑖 − 1 𝑙𝑜𝑔2 𝑖 + 1 Reciprocal Rank (RR) 𝑅𝑅@𝑘 = max 1<𝑖<𝑘 𝑟𝑒𝑙𝑖 𝑖 Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
  • 35. Features They can often be categorized as: Query-independent or static features e.g., incoming link count and document length Query-dependent or dynamic features e.g., BM25 Query-level features e.g., query length Traditional L2R models employ hand-crafted features that encode IR insights
  • 36. Features Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
  • 37. Approaches Pointwise approach Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑. Pairwise approach Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as label. Reduces to binary classification to predict more relevant document. Listwise approach Directly optimize for rank-based metric, such as NDCG—difficult because these metrics are often not differentiable w.r.t. model parameters. Liu [2009] categorizes different LTR approaches based on training objectives: Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
  • 38. Pointwise objectives Regression loss Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑 e.g., square loss for binary or categorical labels, where, 𝑦 𝑞,𝑑 is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989. David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006. labels prediction 0 1 1
  • 39. Pointwise objectives Classification loss Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑 e.g., cross-entropy with softmax over categorical labels 𝑌 [Li et al., 2008], where, 𝑠 𝑦 𝑞,𝑑 is the model’s score for label 𝑦 𝑞,𝑑 labels prediction 0 1 Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
  • 40. Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009], where, 𝜙 can be, • Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000] • Exponential function 𝜙 𝑧 = 𝑒−𝑧 [Freund et al., 2003] • Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧 [Burges et al., 2005] • Others… Pairwise loss minimizes the average number of inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is ranked higher than 𝑑𝑖 Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 , Feature vectors: 𝑥𝑖 and 𝑥𝑗 Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗 Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
  • 41. Pairwise objectives RankNet loss Pairwise loss function proposed by Burges et al. [2005]—an industry favourite [Burges, 2015] Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡ 𝑒 𝛾.𝑠 𝑖 𝑒 𝛾.𝑠 𝑖 +𝑒 𝛾.𝑠 𝑗 = 1 1+𝑒 −𝛾. 𝑠 𝑖−𝑠 𝑗 Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0 Computing cross-entropy between 𝑝 and 𝑝 ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗 pairwise preference score 0 1 Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005. Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
  • 42. A generalized cross-entropy loss An alternative loss function assumes a single relevant document 𝑑+ and compares it against the full collection 𝐷 Predicted probabilities: p 𝑑+|𝑞 = 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 The cross-entropy loss is then given by, ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 Computing the softmax over the full collection is prohibitively expensive—LTR models typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 43. Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higher ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable LISTWISE OBJECTIVES Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010. [Burges, 2010]
  • 44. Listwise objectives Burges et al. [2006] make two observations: 1. To train a model we don’t need the costs themselves, only the gradients (of the costs w.r.t model scores) 2. It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006. LambdaRank loss Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents
  • 45. Listwise objectives According to the Luce model [Luce, 2005], given four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability of observing a particular rank-order, say 𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by: where, 𝜋 is a particular permutation and 𝜙 is a transformation (e.g., linear, exponential, or sigmoid) over the score 𝑠𝑖 corresponding to item 𝑑𝑖 R Duncan Luce. Individual choice behavior. 1959. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008. ListNet loss Cao et al. [2007] propose to compute the probability distribution over all possible permutations based on model score and ground- truth labels. The loss is then given by the K-L divergence between these two distributions. This is computationally very costly, computing permutations of only the top-K items makes it slightly less prohibitive. ListMLE loss Xia et al. [2008] propose to compute the probability of the ideal permutation based on the ground truth. However, with categorical labels more than one permutation is possible.
  • 46. Listwise objectives Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009. Smooth DCG Wu et al. [2009] compute a “smooth” rank of documents as a function of their scores This “smooth” rank can be plugged into a ranking metric, such as MRR or DCG, to produce a smooth ranking loss