Publicité

Contenu connexe

Similaire à Improving search with neural ranking methods(20)

Publicité

Improving search with neural ranking methods

  1. Improving search with neural ranking methods Andrew Yates Assistant Professor, UvA @andrewyates
  2. About me Assistant Professor at University of Amsterdam (2021-) • Information retrieval, neural networks, natural language processing, (consumer) medical domain, personal knowledge graphs, … Previously • Senior Researcher (2018-2021) at Max Planck Institute for Informatics; Postdoc at MPII before that (2016-2018) • PhD in Computer Science from Georgetown University, Washington, DC • BSc in Computer Science from Illinois Institute of Technology, Chicago 2
  3. 3 Book on Neural IR (link on final slide) Podcast on Neural IR Available: Spotify, iTunes, …
  4. 4 Students and collaborators: Siddhant Arora, Arman Cohan, Jeffrey Dalton, Doug Downey, Sergey Feldman, Nazli Goharian, Kevin Martin Jose, Jimmy Lin, Sean MacAvaney, Thong Nguyen, Wei Yang, Xinyu Zhang, … Capreolus: A Toolkit for End-to-End Neural Ad Hoc Retrieval
  5. Outline • About me • What is NIR? • The good, the bad, and the ugly 5
  6. Neural? IR? Focus on ad hoc retrieval, better known as “googling something” ➔ We all do this a lot, and it works incredibly well ➔ Given an arbitrary query, return ranked list of relevant “documents” 6 query black bear attacks ... collection 1. 2. 3. metric: 0.66 ranking method
  7. Retrieval 101 • Crawl & index documents (text) • Query time: look for exact matches in documents, optionally combining with other signals (e.g., spam score) 7
  8. Retrieval 101 • Consider query terms that appear in document • Weight them by discriminativeness (inverse document frequency) and importance (term frequency) 8 IDF component TF component term saturation length normalization Consider only terms in both query & document
  9. Retrieval 101 9 • Consider query terms that appear in document • Weight them by discriminativeness (inverse document frequency) and importance (term frequency) General formula
  10. Neural retrieval? Neural refers to a type of machine learning method, which bring two key improvements 10
  11. Improvement #1: soft matching of terms Exact matching of Q and D terms Non-neural approaches: translation models, topic models like pLSI, ... 11
  12. Inverse Document Frequency of query term (collection stat) Term Frequency of q in document D (document stat) Improvement #2: less handcrafting (learn how to score) 12
  13. Inverse Document Frequency of query term Term Frequency of q in document D N = # docs and ni = # docs with term qi IDF defined as: • log N/ni • log (1 + N/ni ) • log (N- ni )/ni • … 13
  14. Inverse Document Frequency of query term Term Frequency of q in document D TF defined as: • TF(qi , D) • 1 + log TF(qi , D) • 0.5 + 0.5 [TF(qi , D) / max TF(D)] • … 14
  15. Soft matching Key point: neural methods replace exact matching with semantic matching With traditional methods, soft matching is possible (but it’s less effective) ● Stemming handles matches like singular vs. plural and different tenses swam, swimming, swim, swims replaced with swim ● Query expansion approaches handle context by adding related terms to query Query contains bank, referring to turning a plane Add context terms to query: flight, airplane, ailerons, spoilers, ... 15
  16. Soft matching via embeddings 16 Source: https://medium.com/@hari4om/word-embedding-d816f643140 Neural approaches represent words as learned vectors
  17. Soft matching via embeddings 17 Source: https://medium.com/@hari4om/word-embedding-d816f643140 Vector “embedding” allows comparisons of different terms
  18. Soft matching via embeddings 18 Source: https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12 Embeddings capture a general notion of semantic similarity
  19. … … … … … … … … … [SEP] EA P8 E[CLS] T[CLS] [CLS] E1 T1 A1 E2 T2 A2 E3 T3 A3 E4 T4 [SEP] E5 T5 B1 E6 T6 B2 E7 T7 E[SEP] T[SEP] [CLS] tThe tbank tmakes tloans t[MASK] tclients EA EA EA EA EA EA EA P0 P1 P2 P3 P4 P5 P6 + + + + + + + + + + + + + + + + Token Embeddings Segment Embeddings Position Embeddings t. EA P7 + + The bank makes loans [MASK] clients . Loss = -log (P("to" | masked input)) Random masking 19 D ྾ V softmax Embeddings can be learned automatically Randomly- initialized Large Language Model (>100 million weights) Devlin, Chang, Lee, Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. “BERT”
  20. … … … … … … … … … [SEP] EA P8 E[CLS] T[CLS] [CLS] E1 T1 A1 E2 T2 A2 E3 T3 A3 E4 T4 [SEP] E5 T5 B1 E6 T6 B2 E7 T7 E[SEP] T[SEP] [CLS] tThe tbank tmakes tloans t[MASK] tclients EA EA EA EA EA EA EA P0 P1 P2 P3 P4 P5 P6 + + + + + + + + + + + + + + + + Token Embeddings Segment Embeddings Position Embeddings t. EA P7 + + The bank makes loans [MASK] clients . Loss = -log (P("to" | masked input)) Random masking 20 D ྾ V softmax Embeddings can be learned automatically Advantages ● Automatically learn semantic representations of words! ● These embeddings capture general similarity Disadvantages ● These embeddings capture general similarity Randomly- initialized Large Language Model (>100 million weights)
  21. Neural IR: the good, the bad, and the ugly • The good • Big improvements in search quality • Key ingredient: putting words in context • Nice property: zero-shot ranking works • The bad • The ugly 21
  22. source We’re making a significant improvement to how we understand queries, representing the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search. Starting from April of this year (2019), we used large transformer models to deliver the largest quality improvements to our Bing customers in the past year. MS Bing Google Search source 22 Search quality improvements: industry
  23. Source: Yang et al., SIGIR 2019 Yilmaz et al., 2019; MacAvaney et al. 2019 Li et al., 2020; Nogueira et al., 2020 23 Robust04 Dataset (news articles) Search quality improvements: academia
  24. ~8 points! Search quality improvements: academia 24 MS MARCO Passage Dataset (Web pages) Source: MS MARCO Leaderboard 19 point improvement: BM25 vs. best neural 8 point improvement: neural pre-LLM vs. LLM
  25. Neural IR: the good, the bad, and the ugly • The good • Big improvements in search quality • Key ingredient: putting words in context • Nice property: zero-shot ranking works • The bad • The ugly 25
  26. Key ingredient: contextualization Modern approaches use a LLM (BERT) to create contextualized embeddings 26 Relevant Document Nonrelevant MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019.
  27. Key ingredient: contextualization Score can be produced by placing document terms in similarity buckets, then computing score based on the size of each bucket 27 sim = 1.0 0.6 < sim < 1.0 sim <= 0.6 Tesla 4 terms 3 terms 20 terms net 1 term 8 terms 40 terms worth 2 terms 2 terms 15 terms MacAvaney, Yates, Cohan, Goharian. CEDR: Contextualized Embeddings for Document Ranking. SIGIR 2019. Guo, Fan, Ai, Croft. A Deep Relevance Matching Model for Ad-hoc Retrieval. CIKM 2016. Query Terms Similarity Buckets Score(Tesla) = 4*a + 3*b + 20*c Total score = Score(Tesla) + Score(net) + Score(worth)
  28. Neural IR: the good, the bad, and the ugly • The good • Big improvements in search quality • Key ingredient: putting words in context • Nice property: zero-shot ranking works • The bad • The ugly 28
  29. Zero-shot ranking Zero-shot: “this is my first attempt at a task” ➔ We typically think of search as zero-shot If Apple releases iDevice tomorrow, can immediately search for it ➔ In experiments, operationalized as “this is my first attempt at this dataset” 29 A Little Bit Is Worse Than None: Ranking with Limited Training Data. Zhang, Yates, Lin. SustaiNLP 2020.
  30. 30 Zero-shot ranking A Little Bit Is Worse Than None: Ranking with Limited Training Data. Zhang, Yates, Lin. SustaiNLP 2020. No improvement training with up to 50% of in-domain data Neural method (zero-shot) Traditional methods
  31. Neural IR: the good, the bad, and the ugly • The good • The bad • Which approach is best? Depends on what we measure • Zero-shot effectiveness varies a lot • The ugly 31
  32. Many categories of queries are ‘solved’ by current systems 121171: define etruscans 1113256: what is reba mcentire's net worth 701453: what is a statutory deed Basic definition Simple factoid QA Easy description What should we measure? How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
  33. But others are more difficult 1037798: who is robert gray 10486: cost of interior concrete flooring Ambiguous Ambiguous and localized 144862: did prohibition increased crime Complex reasoning 507445: symptoms of different types of brain bleeds Concerns multiple entities What should we measure? How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
  34. 144862: did prohibition increased crime Complex reasoning 507445: symptoms of different types of brain bleeds Concerns multiple entities Hard + Interesting How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
  35. Criteria • Non-Factoid: Not answerable by a single short answer • Beyond single passage: More than a simple definition • Answerable: Answerable solely from the provided context • Text-focused: Non-text answers or calculations should be excluded • Mostly well-formed: Not contain clear spelling or language errors • Possibly Complex: Multiple entities, comparison, complex reasoning, or multiple answers Entity Linking Query Annotations Web Search Engine Annotations TREC Deep Learning Collection Queries Runs What makes a query difficult & interesting?
  36. 41% Factoid 10% List 18% Long Answer 24% Web Search 8% Factoid 31% List 27% Long Answer 35% Web Search TREC Deep Learning Collection DL-HARD Collection The query type matters How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021.
  37. 41% Factoid 10% List 18% Long Answer 24% Web Search 8% Factoid 31% List 27% Long Answer 35% Web Search TREC Deep Learning Collection DL-HARD Collection The query type matters How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset. Mackie, Dalton, Yates. SIGIR 2021. Swaps in system rankings ● rank 1 → rank 4 ● rank 13 → rank 5 ● bottom → bottom + 1 ● etc.
  38. Neural IR: the good, the bad, and the ugly • The good • The bad • Which approach is best? Depends on what we measure • Zero-shot effectiveness varies a lot • The ugly 38
  39. Zero-shot: reranking > dense retrieval 39 Reranking approaches compare individual query and document terms
  40. 40 s Zero-shot: reranking > dense retrieval Instead of comparing terms, entire query/doc represented by single vector
  41. Neural IR: the good, the bad, and the ugly • The good • The bad • The ugly • General similarity: prone to capturing bias and prone to using shortcuts (which do not generalize) • Little control over what knowledge is stored inside the neural model 41
  42. 42 Term gender doctor who women dr medical her pregnancy … birth baby number Weight 112 96 71 67 60 55 52 43 … 28 24 7 What’s inside a dense representation? query Tesla’s net worth 0.0 0.2 0.3 2.5 1.7 0.8 query vector Converter term weights tesla worth net price car … … … 32 23 20 18 11 Query: what is the gender of that doctor? Ongoing work with Thong Nguyen, IRLab, UvA.
  43. Neural IR: the good, the bad, and the ugly • The good • The bad • The ugly • General similarity: prone to capturing bias and prone to using shortcuts (which do not generalize) • Little control over what knowledge is stored inside the neural model 43
  44. 44 Language Models as Knowledge Bases? Petroni, Rocktaeschel, Lewis, Bakhtin, Wu, Miller, Riedel. EMNLP 2019. Language Models As or For Knowledge Bases. Razniewski, Yates, Kassner, Weikum. DL4KG 2021. What’s inside a large language model? Probe a LLM’s word predictions to see how it behaves ➔ Alan Turing was born in [MASK] Advanced LLMs are often correct, but fundamental issues remain • Epistemic: Probabilities ≠ knowledge • Birthplace(Biden) – Scranton (83%), New Castle (0.2%) • Birth country (Bengio) – France (3.5%), Canada (3.2%) • Fallbacks to correlations • Are all Greeks born in Athens? • Are all Panagiotis’ Greek? • Unaware of limits • Microsoft was acquired by … Novell (84%) • Sun Microsystems was acquired by … Oracle (84%)
  45. Conclusion • The good: neural methods have significantly improved ranking quality • The bad: so far, no fast & effective "off the shelf" approach for general use • The ugly: is the bad caused by fundamental limitations of this paradigm? 45 Podcast on Neural IR Available: Spotify, iTunes, … https://arxiv.org/abs/2010.06467 https://bit.ly/tr4tr-book
  46. Conclusion • The good: neural methods have significantly improved ranking quality • The bad: so far, no fast & effective "off the shelf" approach for general use • The ugly: is the bad caused by fundamental limitations of this paradigm? 46 Podcast on Neural IR Available: Spotify, iTunes, … https://arxiv.org/abs/2010.06467 https://bit.ly/tr4tr-book Thanks!
Publicité