SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Relevance Improvements at Cengage
          Ivan Provalov, Cengage Learning
      ivan.provalov@cengage.com, 10/19/2011
Outline
   Relevance improvements at Cengage Learning
   My background
   Cengage Learning
   Relevance issues
   Relevance tuning methodology
   Relevance tuning for English content
   Non-English content search
   Conclusions
My Background
 Ivan Provalov
  • Information Architect, Search
  • New Media: Technology and Development
  • Cengage Learning
 Background
  • Software development, architecture
  • Agile development
  • Information Retrieval, NLP
 Michigan IR Meetup
  • Interested in presenting IR topic online?


                          4
Cengage Learning
• $2B annual revenue
• One of the world’s largest textbook publishers
  and provider of research databases to libraries
• Academic and Professional
   • Brands: Gale, Brooks/Cole, Course Technology,
     Delmar, Heinle, South-Western, and Wadsworth
   • National Geographic School Publishing
• International locations
   • US, Mexico, Brazil, UK, Spain, India, Singapore,
     Australia, China
• Aggregated periodical products
   • 15,000 publications
Search Platform Profile
       Lucene 3.0.2
       Supporting approximately 100 products
       Content rights management
       Custom products
       210 million documents
       150 index fields per shard
       21 million terms per shard’s full text index
       250 queries per second



6
7
8
Mostly Search Guys (MSG)
 picture
Relevance Issues
 Ranking
  • Recency, publication ranking
  • Document length
  • Unbalanced shards
 Zero results queries
 Content quality
  • Near duplicate documents
  • Digitization
 Language
  • Tokenization
  • Detection
                         10
Relevance Tuning Methodology
    TREC
    Metrics: MAP, MRR, Recall
    Internal studies of top results
    Outsource to universities (WSU)
    Relevance feedback web application
    Usage logs review




                         11
Relevance Evaluation Form




             12
Relevance Report by Query Type

Query Type                    Option A Option B Option C
long tail - misspelled           1.225     1.271    2.269
long tail - multi-term           1.502     1.885    1.929
top 100 - frequent expression    2.574     2.013    2.595
top 100 - person                 2.689     1.904    2.778
top 100 - place                  1.863     1.315    2.438
top 100 - single-term            2.753     1.775    2.792
top 100 - work                   2.408     2.198    2.514
Grand Total                      2.377     1.822    2.481



                            13
Usage Data Reporting




              14
Relevance Improvements
 Pre-processing
  • Search assist
 Search time
  • Term query expansions
  • Phrase query expansions
  • Score modifications
 Post-processing
  • Results clustering




                         15
Search Assist
   Search terms recommendation
   Predictive spelling correction for queries
   Commonly occuring phrases in the content
   Over 6 million terms in keyword dictionary
   Limited by relevant content sets
Term Query Expansions
 Stemming
  • Porter stemmer
  • Dynamic
  • Stem families
     {water -> waters, watering, watered}

 Spelling
  • Dictionary size – 125K
  • Skipping when use other phrase-based
    expansions
     {pharmaceeutical -> pharmaceutical}

 Pseudo relevance feedback
  • More like this
                        17
Phrase Query Expansions
 Related subject synonyms
  • Controlled vocabulary
  • Machine aided indexing, manual indexing
     {death penalty -> “capital punishment”}

 Phrase extractions
  • Based on POS pattern (e.g. JJ_NN_NNS)
     {afro-american traditions, religion and
       beliefs -> afro american traditions}




                        18
Score Modifications
 Recency boosting
  • Function Query
  • Tunings – dates range, boost
 Publication boosting
  •   Function Query
  •   Publication list with importance ranks
  •   Clickstream mining
  •   Peer reviewed




                           19
Results Clustering
 Subject index    Dynamic: Carrot2




                      20
Spanish

   Datasets
   Stemmer
   Stopwords
   Encoding

           Lucene Standard   Lucene Spanish
           Analyzer          Analyzer
MAP        0.326             0.401

MRR        0.781             0.8
Recall     0.649             0.716
Arabic

   Datasets
   Letter tokenizer
   Normalization filter
   Stemmer

            Lucene Standard   Lucene Arabic
            Analyzer          Analyzer
MAP         0.192             0.296
MRR         0.516             0.615
Recall      0.619             0.77
Chinese
   60k+ characters with meaning
   No whitespace
   Sorting
   Encoding              Analyzer                 MAP
   Hit highlighting      ChineseAnalyzer          0.384

                            CJKAnalyzer            0.416

                            SmartChineseAnalyzer   0.412

                            IKAnalyzer             0.409

                            Paoding                0.444


                            23
Conclusions
    Lucene relevance ranking
    Simple techniques (stemming, PRF, recency)
    Language specific analyzers
    TREC collections
    Relevancy priority in search system
     development




24
Contact
 Ivan Provalov
  • Ivan.Provalov@cengage.com
  • http://www.linkedin.com/in/provalov
 IR Meetup
  • http://www.meetup.com/Michigan-Information-
    Retrieval-Enthusiasts-Group




                         25
Acknowledgements
 Cengage Learning: J. Zhou, P. Tunney, J.
  Nader, D. Koszewnik, J. McKinley, A. Cabansag,
  P. Pfeiffer, E. Kiel, D. May, B. Grunow, M. Green
 Lucid Imagination: R. Muir
 University of Michigan: B. King
 Wayne State University: B. Hawashin, Y. Wan,
  H. Anghelescu
 Michigan State University: B. Katt
 HTC: R. Laungani, K. Krishnamurthy, R.
  Vetrivelu
 University of Mich. Library: T. Burton-West
                         26
References
 Relevance
  http://en.wikipedia.org/wiki/Relevance_(information_retrieval)
 Lucene in Action, 2nd Edition, Michael McCandless, Erik
  Hatcher, and Otis Gospodnetić
 Introduction to IR http://nlp.stanford.edu/IR-book/information-
  retrieval-book.html
 TREC http://trec.nist.gov/tracks.html
 IBM TREC 2007 http://trec.nist.gov/pubs/trec16/papers/ibm-
  haifa.mq.final.pdf
 L. Larkey, et al, Light Stemming for Arabic Information
  Retrieval. Kluwer/Springer's series on Text, Speech, and
  Language Technology. IR-422, 2005.
 Cengage Learning at the TREC 2010 Session Track, B. King,
  I. Provalov, http://trec.nist.gov/pubs/trec19/papers/gale-
  cengage.rev.SESSION.pdf


27

Contenu connexe

En vedette

SM Lecture Seven - Strategy Evaluation
SM Lecture Seven - Strategy EvaluationSM Lecture Seven - Strategy Evaluation
SM Lecture Seven - Strategy EvaluationStratMgt Advisor
 
Dental public health
Dental public healthDental public health
Dental public healthshabeel pn
 
Designing and conducting summative evaluations
Designing and conducting summative evaluationsDesigning and conducting summative evaluations
Designing and conducting summative evaluationsLarry Cobb
 
Evaluation of educational programs in nursing
Evaluation of educational programs in nursingEvaluation of educational programs in nursing
Evaluation of educational programs in nursingNavjyot Singh
 
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUM
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUMSTAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUM
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUMMina Badiei
 
Monitoring & evaluation presentation[1]
Monitoring & evaluation presentation[1]Monitoring & evaluation presentation[1]
Monitoring & evaluation presentation[1]skzarif
 
Summative and formative evaluation
Summative and formative evaluationSummative and formative evaluation
Summative and formative evaluationKing Ayapana
 
Formative Assessment vs. Summative Assessment
Formative Assessment vs. Summative AssessmentFormative Assessment vs. Summative Assessment
Formative Assessment vs. Summative Assessmentjcheek2008
 

En vedette (12)

Formative evaluation
Formative evaluationFormative evaluation
Formative evaluation
 
SM Lecture Seven - Strategy Evaluation
SM Lecture Seven - Strategy EvaluationSM Lecture Seven - Strategy Evaluation
SM Lecture Seven - Strategy Evaluation
 
Dental public health
Dental public healthDental public health
Dental public health
 
Designing and conducting summative evaluations
Designing and conducting summative evaluationsDesigning and conducting summative evaluations
Designing and conducting summative evaluations
 
Evaluation of educational programs in nursing
Evaluation of educational programs in nursingEvaluation of educational programs in nursing
Evaluation of educational programs in nursing
 
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUM
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUMSTAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUM
STAGE OF CURRICULUM DEVELOPMENT AND EVALUATION IN UPDATING THE ENTIRE CURRICULUM
 
Monitoring & evaluation presentation[1]
Monitoring & evaluation presentation[1]Monitoring & evaluation presentation[1]
Monitoring & evaluation presentation[1]
 
Project Monitoring & Evaluation
Project Monitoring & EvaluationProject Monitoring & Evaluation
Project Monitoring & Evaluation
 
Summative and formative evaluation
Summative and formative evaluationSummative and formative evaluation
Summative and formative evaluation
 
Formative Assessment vs. Summative Assessment
Formative Assessment vs. Summative AssessmentFormative Assessment vs. Summative Assessment
Formative Assessment vs. Summative Assessment
 
Evaluation of classroom instruction
Evaluation of classroom instructionEvaluation of classroom instruction
Evaluation of classroom instruction
 
Types of evaluation
Types of evaluationTypes of evaluation
Types of evaluation
 

Similaire à Relevance Improvements at Cengage - Ivan Provalov

II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...Dr. Haxel Consult
 
Improving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for WindowsImproving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for WindowsQSR International
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise SearchFindwise
 
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...Kaitlan Chu
 
Repository for data crawled from multiple social networks
Repository for data crawled from multiple social networksRepository for data crawled from multiple social networks
Repository for data crawled from multiple social networksConstantinos Christofilos
 
Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Debanjan Mahata
 
Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Debanjan Mahata
 
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...OpenAthens
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...South Tyrol Free Software Conference
 
Speaking on the Record: Combining Interviews with Search Log Analysis in User...
Speaking on the Record: Combining Interviews with Search Log Analysis in User...Speaking on the Record: Combining Interviews with Search Log Analysis in User...
Speaking on the Record: Combining Interviews with Search Log Analysis in User...Lynn Connaway
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupGenomeInABottle
 
Speaking on the record: Combining interviews with search log analysis in user...
Speaking on the record: Combining interviews with search log analysis in user...Speaking on the record: Combining interviews with search log analysis in user...
Speaking on the record: Combining interviews with search log analysis in user...Lynn Connaway
 
Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015Christina Pikas
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 

Similaire à Relevance Improvements at Cengage - Ivan Provalov (20)

II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
II-SDV 2012 Automatic Query Re-Ranking in a Patent Database by Local Frequenc...
 
Improving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for WindowsImproving Your Literature Reviews with NVivo 10 for Windows
Improving Your Literature Reviews with NVivo 10 for Windows
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...
User Experience Design on Cleveland Clinic Corporate Website | Medical Inform...
 
Repository for data crawled from multiple social networks
Repository for data crawled from multiple social networksRepository for data crawled from multiple social networks
Repository for data crawled from multiple social networks
 
Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017
 
Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017
 
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
OpenAthens Conference 2018 - Tim Lull and Chad Smith - Cultivating your onlin...
 
Randall "MECA Project Update"
Randall "MECA Project Update"Randall "MECA Project Update"
Randall "MECA Project Update"
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...
SFScon18 - Ludovik Coba - rrecsys: an R library for prototyping and assessing...
 
180 sspcc3 b_lederman
180 sspcc3 b_lederman180 sspcc3 b_lederman
180 sspcc3 b_lederman
 
Speaking on the Record: Combining Interviews with Search Log Analysis in User...
Speaking on the Record: Combining Interviews with Search Log Analysis in User...Speaking on the Record: Combining Interviews with Search Log Analysis in User...
Speaking on the Record: Combining Interviews with Search Log Analysis in User...
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
 
Speaking on the record: Combining interviews with search log analysis in user...
Speaking on the record: Combining interviews with search log analysis in user...Speaking on the record: Combining interviews with search log analysis in user...
Speaking on the record: Combining interviews with search log analysis in user...
 
Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015Pikas bibliometricsfor21may2015
Pikas bibliometricsfor21may2015
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Relevance Improvements at Cengage - Ivan Provalov

  • 1. Relevance Improvements at Cengage Ivan Provalov, Cengage Learning ivan.provalov@cengage.com, 10/19/2011
  • 2. Outline  Relevance improvements at Cengage Learning  My background  Cengage Learning  Relevance issues  Relevance tuning methodology  Relevance tuning for English content  Non-English content search  Conclusions
  • 3. My Background  Ivan Provalov • Information Architect, Search • New Media: Technology and Development • Cengage Learning  Background • Software development, architecture • Agile development • Information Retrieval, NLP  Michigan IR Meetup • Interested in presenting IR topic online? 4
  • 4. Cengage Learning • $2B annual revenue • One of the world’s largest textbook publishers and provider of research databases to libraries • Academic and Professional • Brands: Gale, Brooks/Cole, Course Technology, Delmar, Heinle, South-Western, and Wadsworth • National Geographic School Publishing • International locations • US, Mexico, Brazil, UK, Spain, India, Singapore, Australia, China • Aggregated periodical products • 15,000 publications
  • 5. Search Platform Profile  Lucene 3.0.2  Supporting approximately 100 products  Content rights management  Custom products  210 million documents  150 index fields per shard  21 million terms per shard’s full text index  250 queries per second 6
  • 6. 7
  • 7. 8
  • 8. Mostly Search Guys (MSG)  picture
  • 9. Relevance Issues  Ranking • Recency, publication ranking • Document length • Unbalanced shards  Zero results queries  Content quality • Near duplicate documents • Digitization  Language • Tokenization • Detection 10
  • 10. Relevance Tuning Methodology  TREC  Metrics: MAP, MRR, Recall  Internal studies of top results  Outsource to universities (WSU)  Relevance feedback web application  Usage logs review 11
  • 12. Relevance Report by Query Type Query Type Option A Option B Option C long tail - misspelled 1.225 1.271 2.269 long tail - multi-term 1.502 1.885 1.929 top 100 - frequent expression 2.574 2.013 2.595 top 100 - person 2.689 1.904 2.778 top 100 - place 1.863 1.315 2.438 top 100 - single-term 2.753 1.775 2.792 top 100 - work 2.408 2.198 2.514 Grand Total 2.377 1.822 2.481 13
  • 14. Relevance Improvements  Pre-processing • Search assist  Search time • Term query expansions • Phrase query expansions • Score modifications  Post-processing • Results clustering 15
  • 15. Search Assist  Search terms recommendation  Predictive spelling correction for queries  Commonly occuring phrases in the content  Over 6 million terms in keyword dictionary  Limited by relevant content sets
  • 16. Term Query Expansions  Stemming • Porter stemmer • Dynamic • Stem families {water -> waters, watering, watered}  Spelling • Dictionary size – 125K • Skipping when use other phrase-based expansions {pharmaceeutical -> pharmaceutical}  Pseudo relevance feedback • More like this 17
  • 17. Phrase Query Expansions  Related subject synonyms • Controlled vocabulary • Machine aided indexing, manual indexing {death penalty -> “capital punishment”}  Phrase extractions • Based on POS pattern (e.g. JJ_NN_NNS) {afro-american traditions, religion and beliefs -> afro american traditions} 18
  • 18. Score Modifications  Recency boosting • Function Query • Tunings – dates range, boost  Publication boosting • Function Query • Publication list with importance ranks • Clickstream mining • Peer reviewed 19
  • 19. Results Clustering  Subject index  Dynamic: Carrot2 20
  • 20. Spanish  Datasets  Stemmer  Stopwords  Encoding Lucene Standard Lucene Spanish Analyzer Analyzer MAP 0.326 0.401 MRR 0.781 0.8 Recall 0.649 0.716
  • 21. Arabic  Datasets  Letter tokenizer  Normalization filter  Stemmer Lucene Standard Lucene Arabic Analyzer Analyzer MAP 0.192 0.296 MRR 0.516 0.615 Recall 0.619 0.77
  • 22. Chinese  60k+ characters with meaning  No whitespace  Sorting  Encoding Analyzer MAP  Hit highlighting ChineseAnalyzer 0.384 CJKAnalyzer 0.416 SmartChineseAnalyzer 0.412 IKAnalyzer 0.409 Paoding 0.444 23
  • 23. Conclusions  Lucene relevance ranking  Simple techniques (stemming, PRF, recency)  Language specific analyzers  TREC collections  Relevancy priority in search system development 24
  • 24. Contact  Ivan Provalov • Ivan.Provalov@cengage.com • http://www.linkedin.com/in/provalov  IR Meetup • http://www.meetup.com/Michigan-Information- Retrieval-Enthusiasts-Group 25
  • 25. Acknowledgements  Cengage Learning: J. Zhou, P. Tunney, J. Nader, D. Koszewnik, J. McKinley, A. Cabansag, P. Pfeiffer, E. Kiel, D. May, B. Grunow, M. Green  Lucid Imagination: R. Muir  University of Michigan: B. King  Wayne State University: B. Hawashin, Y. Wan, H. Anghelescu  Michigan State University: B. Katt  HTC: R. Laungani, K. Krishnamurthy, R. Vetrivelu  University of Mich. Library: T. Burton-West 26
  • 26. References  Relevance http://en.wikipedia.org/wiki/Relevance_(information_retrieval)  Lucene in Action, 2nd Edition, Michael McCandless, Erik Hatcher, and Otis Gospodnetić  Introduction to IR http://nlp.stanford.edu/IR-book/information- retrieval-book.html  TREC http://trec.nist.gov/tracks.html  IBM TREC 2007 http://trec.nist.gov/pubs/trec16/papers/ibm- haifa.mq.final.pdf  L. Larkey, et al, Light Stemming for Arabic Information Retrieval. Kluwer/Springer's series on Text, Speech, and Language Technology. IR-422, 2005.  Cengage Learning at the TREC 2010 Session Track, B. King, I. Provalov, http://trec.nist.gov/pubs/trec19/papers/gale- cengage.rev.SESSION.pdf 27