SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Relevance Improvements at Cengage
          Ivan Provalov, Cengage Learning
      ivan.provalov@cengage.com, 10/19/2011
Outline
   Relevance improvements at Cengage Learning
   My background
   Cengage Learning
   Relevance issues
   Relevance tuning methodology
   Relevance tuning for English content
   Non-English content search
   Conclusions
My Background
 Ivan Provalov
  • Information Architect, Search
  • New Media: Technology and Development
  • Cengage Learning
 Background
  • Software development, architecture
  • Agile development
  • Information Retrieval, NLP
 Michigan IR Meetup
  • Interested in presenting IR topic online?


                          4
Cengage Learning
• $2B annual revenue
• One of the world’s largest textbook publishers
  and provider of research databases to libraries
• Academic and Professional
   • Brands: Gale, Brooks/Cole, Course Technology,
     Delmar, Heinle, South-Western, and Wadsworth
   • National Geographic School Publishing
• International locations
   • US, Mexico, Brazil, UK, Spain, India, Singapore,
     Australia, China
• Aggregated periodical products
   • 15,000 publications
Search Platform Profile
       Lucene 3.0.2
       Supporting approximately 100 products
       Content rights management
       Custom products
       210 million documents
       150 index fields per shard
       21 million terms per shard’s full text index
       250 queries per second



6
7
8
Mostly Search Guys (MSG)
 picture
Relevance Issues
 Ranking
  • Recency, publication ranking
  • Document length
  • Unbalanced shards
 Zero results queries
 Content quality
  • Near duplicate documents
  • Digitization
 Language
  • Tokenization
  • Detection
                         10
Relevance Tuning Methodology
    TREC
    Metrics: MAP, MRR, Recall
    Internal studies of top results
    Outsource to universities (WSU)
    Relevance feedback web application
    Usage logs review




                         11
Relevance Evaluation Form




             12
Relevance Report by Query Type

Query Type                    Option A Option B Option C
long tail - misspelled           1.225     1.271    2.269
long tail - multi-term           1.502     1.885    1.929
top 100 - frequent expression    2.574     2.013    2.595
top 100 - person                 2.689     1.904    2.778
top 100 - place                  1.863     1.315    2.438
top 100 - single-term            2.753     1.775    2.792
top 100 - work                   2.408     2.198    2.514
Grand Total                      2.377     1.822    2.481



                            13
Usage Data Reporting




              14
Relevance Improvements
 Pre-processing
  • Search assist
 Search time
  • Term query expansions
  • Phrase query expansions
  • Score modifications
 Post-processing
  • Results clustering




                         15
Search Assist
   Search terms recommendation
   Predictive spelling correction for queries
   Commonly occuring phrases in the content
   Over 6 million terms in keyword dictionary
   Limited by relevant content sets
Term Query Expansions
 Stemming
  • Porter stemmer
  • Dynamic
  • Stem families
     {water -> waters, watering, watered}

 Spelling
  • Dictionary size – 125K
  • Skipping when use other phrase-based
    expansions
     {pharmaceeutical -> pharmaceutical}

 Pseudo relevance feedback
  • More like this
                        17
Phrase Query Expansions
 Related subject synonyms
  • Controlled vocabulary
  • Machine aided indexing, manual indexing
     {death penalty -> “capital punishment”}

 Phrase extractions
  • Based on POS pattern (e.g. JJ_NN_NNS)
     {afro-american traditions, religion and
       beliefs -> afro american traditions}




                        18
Score Modifications
 Recency boosting
  • Function Query
  • Tunings – dates range, boost
 Publication boosting
  •   Function Query
  •   Publication list with importance ranks
  •   Clickstream mining
  •   Peer reviewed




                           19
Results Clustering
 Subject index    Dynamic: Carrot2




                      20
Spanish

   Datasets
   Stemmer
   Stopwords
   Encoding

           Lucene Standard   Lucene Spanish
           Analyzer          Analyzer
MAP        0.326             0.401

MRR        0.781             0.8
Recall     0.649             0.716
Arabic

   Datasets
   Letter tokenizer
   Normalization filter
   Stemmer

            Lucene Standard   Lucene Arabic
            Analyzer          Analyzer
MAP         0.192             0.296
MRR         0.516             0.615
Recall      0.619             0.77
Chinese
   60k+ characters with meaning
   No whitespace
   Sorting
   Encoding              Analyzer                 MAP
   Hit highlighting      ChineseAnalyzer          0.384

                            CJKAnalyzer            0.416

                            SmartChineseAnalyzer   0.412

                            IKAnalyzer             0.409

                            Paoding                0.444


                            23
Conclusions
    Lucene relevance ranking
    Simple techniques (stemming, PRF, recency)
    Language specific analyzers
    TREC collections
    Relevancy priority in search system
     development




24
Contact
 Ivan Provalov
  • Ivan.Provalov@cengage.com
  • http://www.linkedin.com/in/provalov
 IR Meetup
  • http://www.meetup.com/Michigan-Information-
    Retrieval-Enthusiasts-Group




                         25
Acknowledgements
 Cengage Learning: J. Zhou, P. Tunney, J.
  Nader, D. Koszewnik, J. McKinley, A. Cabansag,
  P. Pfeiffer, E. Kiel, D. May, B. Grunow, M. Green
 Lucid Imagination: R. Muir
 University of Michigan: B. King
 Wayne State University: B. Hawashin, Y. Wan,
  H. Anghelescu
 Michigan State University: B. Katt
 HTC: R. Laungani, K. Krishnamurthy, R.
  Vetrivelu
 University of Mich. Library: T. Burton-West
                         26
References
 Relevance
  http://en.wikipedia.org/wiki/Relevance_(information_retrieval)
 Lucene in Action, 2nd Edition, Michael McCandless, Erik
  Hatcher, and Otis Gospodnetić
 Introduction to IR http://nlp.stanford.edu/IR-book/information-
  retrieval-book.html
 TREC http://trec.nist.gov/tracks.html
 IBM TREC 2007 http://trec.nist.gov/pubs/trec16/papers/ibm-
  haifa.mq.final.pdf
 L. Larkey, et al, Light Stemming for Arabic Information
  Retrieval. Kluwer/Springer's series on Text, Speech, and
  Language Technology. IR-422, 2005.
 Cengage Learning at the TREC 2010 Session Track, B. King,
  I. Provalov, http://trec.nist.gov/pubs/trec19/papers/gale-
  cengage.rev.SESSION.pdf


27

Contenu connexe

En vedette (12)

Formative evaluation
Formative evaluationFormative evaluation
Formative evaluation
 
SM Lecture Seven - Strategy Evaluation
SM Lecture Seven - Strategy EvaluationSM Lecture Seven - Strategy Evaluation
SM Lecture Seven - Strategy Evaluation
 
Dental public health
Dental public healthDental public health
Dental public health
 
Designing and conducting summative evaluations
Designing and conducting summative evaluationsDesigning and conducting summative evaluations
Designing and conducting summative evaluations
 
Evaluation of educational programs in nursing
Evaluation of educational programs in nursingEvaluation of educational programs in nursing
Evaluation of educational programs in nursing
 
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUM
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUMSTAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUM
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUM
 
Monitoring & evaluation presentation[1]
Monitoring & evaluation presentation[1]Monitoring & evaluation presentation[1]
Monitoring & evaluation presentation[1]
 
Project Monitoring & Evaluation
Project Monitoring & EvaluationProject Monitoring & Evaluation
Project Monitoring & Evaluation
 
Summative and formative evaluation
Summative and formative evaluationSummative and formative evaluation
Summative and formative evaluation
 
Formative Assessment vs. Summative Assessment
Formative Assessment vs. Summative AssessmentFormative Assessment vs. Summative Assessment
Formative Assessment vs. Summative Assessment
 
Evaluation of classroom instruction
Evaluation of classroom instructionEvaluation of classroom instruction
Evaluation of classroom instruction
 
Types of evaluation
Types of evaluationTypes of evaluation
Types of evaluation
 

Similaire à Relevance Improvements at Cengage - Ivan Provalov

II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
Dr. Haxel Consult
 
Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017
Debanjan Mahata
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
GenomeInABottle
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 

Similaire à Relevance Improvements at Cengage - Ivan Provalov (20)

II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
 
Improving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for WindowsImproving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for Windows
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...
 
Repository for data crawled from multiple social networks
Repository for data crawled from multiple social networksRepository for data crawled from multiple social networks
Repository for data crawled from multiple social networks
 
Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017
 
Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017
 
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
 
Randall "MECA Project Update"
Randall "MECA Project Update"Randall "MECA Project Update"
Randall "MECA Project Update"
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...
 
180 sspcc3 b_lederman
180 sspcc3 b_lederman180 sspcc3 b_lederman
180 sspcc3 b_lederman
 
Speaking on the Record: Combining Interviews with Search Log Analysis in User...
Speaking on the Record: Combining Interviews with Search Log Analysis in User...Speaking on the Record: Combining Interviews with Search Log Analysis in User...
Speaking on the Record: Combining Interviews with Search Log Analysis in User...
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
 
Speaking on the record: Combining interviews with search log analysis in user...
Speaking on the record: Combining interviews with search log analysis in user...Speaking on the record: Combining interviews with search log analysis in user...
Speaking on the record: Combining interviews with search log analysis in user...
 
Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 

Plus de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Relevance Improvements at Cengage - Ivan Provalov

  • 1. Relevance Improvements at Cengage Ivan Provalov, Cengage Learning ivan.provalov@cengage.com, 10/19/2011
  • 2. Outline  Relevance improvements at Cengage Learning  My background  Cengage Learning  Relevance issues  Relevance tuning methodology  Relevance tuning for English content  Non-English content search  Conclusions
  • 3. My Background  Ivan Provalov • Information Architect, Search • New Media: Technology and Development • Cengage Learning  Background • Software development, architecture • Agile development • Information Retrieval, NLP  Michigan IR Meetup • Interested in presenting IR topic online? 4
  • 4. Cengage Learning • $2B annual revenue • One of the world’s largest textbook publishers and provider of research databases to libraries • Academic and Professional • Brands: Gale, Brooks/Cole, Course Technology, Delmar, Heinle, South-Western, and Wadsworth • National Geographic School Publishing • International locations • US, Mexico, Brazil, UK, Spain, India, Singapore, Australia, China • Aggregated periodical products • 15,000 publications
  • 5. Search Platform Profile  Lucene 3.0.2  Supporting approximately 100 products  Content rights management  Custom products  210 million documents  150 index fields per shard  21 million terms per shard’s full text index  250 queries per second 6
  • 6. 7
  • 7. 8
  • 8. Mostly Search Guys (MSG)  picture
  • 9. Relevance Issues  Ranking • Recency, publication ranking • Document length • Unbalanced shards  Zero results queries  Content quality • Near duplicate documents • Digitization  Language • Tokenization • Detection 10
  • 10. Relevance Tuning Methodology  TREC  Metrics: MAP, MRR, Recall  Internal studies of top results  Outsource to universities (WSU)  Relevance feedback web application  Usage logs review 11
  • 12. Relevance Report by Query Type Query Type Option A Option B Option C long tail - misspelled 1.225 1.271 2.269 long tail - multi-term 1.502 1.885 1.929 top 100 - frequent expression 2.574 2.013 2.595 top 100 - person 2.689 1.904 2.778 top 100 - place 1.863 1.315 2.438 top 100 - single-term 2.753 1.775 2.792 top 100 - work 2.408 2.198 2.514 Grand Total 2.377 1.822 2.481 13
  • 14. Relevance Improvements  Pre-processing • Search assist  Search time • Term query expansions • Phrase query expansions • Score modifications  Post-processing • Results clustering 15
  • 15. Search Assist  Search terms recommendation  Predictive spelling correction for queries  Commonly occuring phrases in the content  Over 6 million terms in keyword dictionary  Limited by relevant content sets
  • 16. Term Query Expansions  Stemming • Porter stemmer • Dynamic • Stem families {water -> waters, watering, watered}  Spelling • Dictionary size – 125K • Skipping when use other phrase-based expansions {pharmaceeutical -> pharmaceutical}  Pseudo relevance feedback • More like this 17
  • 17. Phrase Query Expansions  Related subject synonyms • Controlled vocabulary • Machine aided indexing, manual indexing {death penalty -> “capital punishment”}  Phrase extractions • Based on POS pattern (e.g. JJ_NN_NNS) {afro-american traditions, religion and beliefs -> afro american traditions} 18
  • 18. Score Modifications  Recency boosting • Function Query • Tunings – dates range, boost  Publication boosting • Function Query • Publication list with importance ranks • Clickstream mining • Peer reviewed 19
  • 19. Results Clustering  Subject index  Dynamic: Carrot2 20
  • 20. Spanish  Datasets  Stemmer  Stopwords  Encoding Lucene Standard Lucene Spanish Analyzer Analyzer MAP 0.326 0.401 MRR 0.781 0.8 Recall 0.649 0.716
  • 21. Arabic  Datasets  Letter tokenizer  Normalization filter  Stemmer Lucene Standard Lucene Arabic Analyzer Analyzer MAP 0.192 0.296 MRR 0.516 0.615 Recall 0.619 0.77
  • 22. Chinese  60k+ characters with meaning  No whitespace  Sorting  Encoding Analyzer MAP  Hit highlighting ChineseAnalyzer 0.384 CJKAnalyzer 0.416 SmartChineseAnalyzer 0.412 IKAnalyzer 0.409 Paoding 0.444 23
  • 23. Conclusions  Lucene relevance ranking  Simple techniques (stemming, PRF, recency)  Language specific analyzers  TREC collections  Relevancy priority in search system development 24
  • 24. Contact  Ivan Provalov • Ivan.Provalov@cengage.com • http://www.linkedin.com/in/provalov  IR Meetup • http://www.meetup.com/Michigan-Information- Retrieval-Enthusiasts-Group 25
  • 25. Acknowledgements  Cengage Learning: J. Zhou, P. Tunney, J. Nader, D. Koszewnik, J. McKinley, A. Cabansag, P. Pfeiffer, E. Kiel, D. May, B. Grunow, M. Green  Lucid Imagination: R. Muir  University of Michigan: B. King  Wayne State University: B. Hawashin, Y. Wan, H. Anghelescu  Michigan State University: B. Katt  HTC: R. Laungani, K. Krishnamurthy, R. Vetrivelu  University of Mich. Library: T. Burton-West 26
  • 26. References  Relevance http://en.wikipedia.org/wiki/Relevance_(information_retrieval)  Lucene in Action, 2nd Edition, Michael McCandless, Erik Hatcher, and Otis Gospodnetić  Introduction to IR http://nlp.stanford.edu/IR-book/information- retrieval-book.html  TREC http://trec.nist.gov/tracks.html  IBM TREC 2007 http://trec.nist.gov/pubs/trec16/papers/ibm- haifa.mq.final.pdf  L. Larkey, et al, Light Stemming for Arabic Information Retrieval. Kluwer/Springer's series on Text, Speech, and Language Technology. IR-422, 2005.  Cengage Learning at the TREC 2010 Session Track, B. King, I. Provalov, http://trec.nist.gov/pubs/trec19/papers/gale- cengage.rev.SESSION.pdf 27