Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning meets Search!

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 73 Publicité

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning meets Search!

Télécharger pour lire hors ligne

This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.

To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.

This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.

To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (18)

Publicité

Similaire à RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning meets Search! (20)

Plus récents (20)

Publicité

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning meets Search!

  1. 1. RecSys 2015 Tutorial Scalable Recommender Systems Where Machine Learning Meets Search!
  2. 2. Presenters Diana Hu Senior Data Scientist @sdianahu diana.hu@verizon.com Joaquin Delgado, PhD. Director of Engineering @joaquind joaquin.a.delgado@verizon.com
  3. 3. Disclaimer The content of this presentation are of the authors’ personal statements and does not officially represent their employer’s view in anyway. Included content is especially not intended to convey the views of OnCue or Verizon.
  4. 4. Index 1. Introduction 1. What to expect? 2. Scaling recommender systems is hard 2. Recommender System Problem as a Search Problem 1. Representing queries as recommendations 3. Introduction to Search and Information Retrieval 1. Scalability in search 2. Introduction to Elasticsearch 4. Overview of Machine Learning Techniques for Recommender Systems 1. Learning to rank 2. Scalability in machine learning 3. ML software frameworks 5. Re-writing the ranking function 1. Writing a new ranking/scoring function in Elasticsearch 2. Training a spark model as a Elasticsearch plugin for custom ranking/scoring function 6. References
  5. 5. 1. Introduction
  6. 6. What to expect from this tutorial? • The focus is on practical examples of how to implement scalable recommender systems using search and learning-to-rank (machine learning) techniques • What it is not • Deep dive into any specific areas (Search, RecSys, Learning to rank, or Machine learning) • Algorithmic survey • Comparative Analysis
  7. 7. Finding commonalities Ranking RecSys Discovery Information Retrieval Search Advertising
  8. 8. What is a recommendation? Beyond rating prediction
  9. 9. Paradigms of recommender systems • Reduce information load by estimating relevance • Ranking Approaches: • Collaborative filtering: “Tell me what is popular amongst my peers” • Content Based: “Show me more of what I liked” • Knowledge Based: “Tell me what fits my needs” • Hybrid
  10. 10. Model Type Pros Cons Collaborative • No metadata engineering effort • Serendipity of results • Learns market segments • Requires rating feedback • Cold start for new users and new items Content-based • No community required • Comparison between items possible • Content descriptions necessary • Cold start for new users • No serendipity Knowledge- based • Deterministic • Assured quality • No cold-start • Interactive user sessions • Knowledge engineering effort to bootstrap • Static • Does not react to short-term trends
  11. 11. Scaling recommender systems is hard! • Millions of users • Millions of items • Cold start for ever increasing size of catalog and new users added • Imbalanced Datasets – power law distribution is quite common • Many algorithms have not been fully tested at “Internet Scale”
  12. 12. 2. Recommender System Problem as a Search Problem
  13. 13. Content-based methods inspired by IR • Rec Task: Given a user profile find the best matching items by their attributes • Similarity calculation: based on keyword overlap between user/items • Neighborhood method (i.e. nearest neighbor) • Query-based retrieval (i.e Rocchio’s method) • Probabilistic methods (classical text classification) • Explicit decision models • Feature representation: based on content analysis • Vector space model • TF-IDF • Topic Modeling
  14. 14. Search queries as content-based recommendations • Exact matching (Boolean) • Relevant or not relevant (no ranking) • Ranking by similarity to query (Vector Space Model) • Text similarity: Bag of words, TF-IDF, Incidence Matrix • Ranking by importance (e.g. PageRank)
  15. 15. Content-based similarity measures • Simple match • Dice’s Coefficient • Jaccard’s Coefficient • Cosine Coefficient • Overlap Coefficient 3D Term Vector Space
  16. 16. Knowledge-based methods inspired by IR • Rec Task: Given explicit recommendation rules find the best matches between user’s requirements and item’s characteristics (i.e., which item should be recommended in which context?) • Similarity calculation: based on constraint satisfaction problem and distance similarity requirements<->attributes • Conjunctive queries • Similarity metrics for item retrieval • Feature representation: based on query representation • User defined preferences • Utility-based preferences • Conjoint analysis
  17. 17. Search queries as knowledge-based recommendations • Constraint satisfaction problem (CSP) is a tuple (V,D,C) • V – set of variables • D – set of finite domains for V • C – set of constraints of possible V permutations • Recommendation as CSP: (V,D,C) => (Vi U Vu, D, Cr U Ci U Cf U REQ) • Vu – user properties (possible user’s requirements) • Vi – item properties • Cr – compatibility constraints (possible Vc permutations) • Ci – Item constraints (conjunction fully defines an item) • Cf – filter conditions (define Vu<->Vi relationships) • REQ – user’s requirements
  18. 18. 3. Introduction to Search and Information Retrieval
  19. 19. Search Search is about finding specific things that are either known or assumed to exist, Discovery is about is about helping the user encounter what he/she didn’t even know exists Both Search and Discovery can be achieved through a query based data/information system. Predicate Logic and Declarative Languages Rock!
  20. 20. Examples of query based systems • Focused on Search • Search engines • Database systems • Focus on Discovery • Recommender systems • Advertising systems
  21. 21. IR: The science behind search! Information Retrieval (IR) is a query based on data retrieval + relevance ranking (scoring) usually applied to unstructured data (i.e. text documents and fields); often referred to as full- text or keyword search. Have you heard of Bag-of-Words? Vector Space Representation? What about TF-IDF?
  22. 22. IR Architecture Matched Hits Representation Function Similarity Calculation Matched Hits Documents Representation Function Input Query Matched Hits Matched Hits Retrieved Documents Online Processing Offline Processing (*)RelevanceFeedback Query Representation Doc Representation Index (*) Optional
  23. 23. Retrieval Models Model Type Query Representation Document Representation Retrieval Boolean • Boolean expressions • Connected by AND, OR, NOT • Set of keywords • Bag of words • Binary term weight • Exact match • Binary relevance • No ranking Vector Space Model • Vector • Desired terms with optional weights • Vectors • Bag of words with weight based on TF-IDF scheme • Similarity score • Output documents are ranked • Relevance feedback support Probabilisti c • Similarity with priors • Document relevance • Ranks documents in decreasing probability of relevance
  24. 24. Ranking in the Vector Space Model
  25. 25. Search Engines: the big hammer! • Search engines are largely used to solve non-IR search problems, and here is why: • Widely available • Fast and scalable distributed systems • Integrates well with existing data stores (SQL and NoSQL)
  26. 26. But are we using the right tool? • Search Engines were originally designed for IR. • More complex non-IR search/discovery tasks sometimes require a multi-phase, multi-system approach
  27. 27. Filter + Scoring: Two Phase Approach Filter Rank
  28. 28. Elasticsearch • What is Elasticsearch? • Elasticsearch is an open-source search engine • Elasticsearch is written in Java • Built on top of Apache Lucene™ • A distributed real-time document store where every field is indexed and searchable out-of-the box • A distributed search engine with real-time analytics • Has a plugin architecture that facilitates extending the core system • Written with NRT and cloud support in mind • Easy index, shard and replicas creation on live cluster • Has Optimistic Concurrency Control
  29. 29. Examples of scaling challenges • More than 50 millions of documents a day • Real time search • Less than 200ms average query latency • Throughput of at least 1000 QPS • Multilingual indexing • Multilingual querying
  30. 30. Who uses ES? • Wikipedia • Uses Elasticsearch to provide full-text search with highlighted search snippets, and search-as-you-type and did-you-mean suggestions. • The Guardian • Uses Elasticsearch to combine visitor logs with social -network data to provide real-time feedback to its editors about the public’s response to new articles. • Stack Overflow • Combines full-text search with geo-location queries and uses more-like-this to find related questions and answers. • GitHub • Uses Elasticsearch to query 130 billion lines of code.
  31. 31. How ES scales? • Sharding and Replicas • Several indices (at least one index for each day of data) • Indices divided into multiple shards • Multiple replicas of a single shard • Real-time, synchronous replication • Near-real-time index refresh (1 to 30 seconds)
  32. 32. Indexing the data
  33. 33. Querying ES Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 ES Index Application
  34. 34. Using Search Engines for RS • Its not just about rating prediction and ranking • Business filtering logic • Age restrictions • Catalog navigation context (e.g. e-commerce) • Promotional materials • Low latency and scale • SLAs on response times including query, responses and presentation • Actual time for computing recommendations is just a small fraction of total allocated time
  35. 35. Stacking things up Visualization / UI Retrieval Ranking Query Generation and Contextual Pre-filtering Model Building Index Building Data/Events Collections Data Analytics Contextual Post Filtering OnlineOffline Experimentation
  36. 36. Ranking in Elasticsearch
  37. 37. 4. Overview of Machine Learning Techniques for Recommender Systems
  38. 38. Machine Learning Machine Learning in particular supervised learning refer to techniques used to learn how to classify or score previously unseen objects based on a training dataset Inference and Generalization are the Key!
  39. 39. Recommendations as data mining Amatriain, Xavier, et al. "Data mining methods for recommender systems." Recommender Systems Handbook. Springer US, 2011. 39-71.
  40. 40. Learning to rank • Formulate the problem as standard supervised learning • Training data can be cardinal or binary • Various approaches: • Pointwise: Typically approximated by regression • Pairwise: Approximated via binary classifier • Listwise: Directly optimize whole list (difficult!) • A trick with ES is to include raw scores returned by ES into the feature vector
  41. 41. Learning to rank with ES Elastic Search ES Query ES Index Input: Contextual features Potential Matches Trained Ranking Model ML Framework + Gold Dataset Output: Ranked Results
  42. 42. Web scale ML challenges • Massive amount of examples • Billions of features • Big models don’t fit in a single machine’s memory • Variety of algorithms that need to be scaled up A Note of Caution….
  43. 43. “Invariably, simple models and a lot of data trump more elaborate models based on less data.” Alon Halevy, Peter Norvig, and Fernando Pereira, Google http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/35179.pdf
  44. 44. Scalability in Machine Learning • Distributed systems – Fault tolerance, Throughput vs. latency • Parallelization Strategies – Hashing, trees • Processing – Map reduce variants, MPI, graph parallel • Databases – Key/Value Stores, NoSQL
  45. 45. What is Spark? Fast, expressive cluster computing system 45 BlinkDB approx queries Spark SQL structured data MLlib machine learning Spark Streaming real-time GraphX graph Analytics Spark Core
  46. 46. What is Spark? • Work on distributed collections like local ones • RDD: • Immutable • Parallel transforms • Resilient and configurable persistence • Operations • Transforms: Lazy operations (map, filter, join,…) • Actions: Return/write results (collect, save, count,…)
  47. 47. ML Software Framework: Spark MLlib • Subproject with ML primitives • Building blocks (as a framework vs. library) • Large scale statistics • Classification • Regression • Clustering • Matrix factorization • Optimization • Frequent pattern mining • Dimensionality reduction
  48. 48. What is ML-Scoring? • Creates an Elastic Search (ES) document index of instances • Trains a supervised learning ML model from a dataset of instances + labels • Generate an Elasticsearch plugin that uses the trained ML model to score documents at query time • A • An Open Source POC!
  49. 49. Remember the elephant? Visualization / UI Retrieval Ranking Query Generation and Contextual Pre-filtering Model Building Index Building Data/Events Collections Data Analytics Contextual Post Filtering OnlineOffline Experimentation
  50. 50. Simplifying the Stack! Visualization / UI Query Generation and Contextual Pre-filtering Model Building Index Building Data/Events Collections Data Analytics Retrieval Contextual Post Filtering Ranking OnlineOffline Experimentation
  51. 51. Elastic Search ML-Scoring Architecture Instances + Labels Trainer + Indexer Instances Index ML Scoring Plugin Serialized ML Model
  52. 52. 5. Re-writing the ranking function
  53. 53. Using ML-Scoring • Creating an ES Index • Boolean queries • More-Like-This queries • Built-in scoring functions • Scoring script • Scoring plugin • ML-Score evaluator using Spark • ML-Score query
  54. 54. Creating an Index in ES POST /my_movie_catalog/movies/_bulk { "index": { "_id": 1 }} { ”genre" : “Documentary”, ”productID" : "XHDK-A-1293-#fJ3" , “title” : “Olympic Sports”, “content” : “Olympic greateness…“, price” : 20} { "index": { "_id": 2 }} { ”genre" : “Sports”, ”productID" : "KDKE-B-9947-#kL5", “title” : “NY Yankees: Winning the World Series”, , “content” : “There is no better team than the NY…“ “price” :20} { "index": { "_id": 3 }} { ”genre" : “Action”, “productID" : "JODL-X-1937-#pV7",”title” : “Rambo III”, , “content” : “Sylvester Stallone is evermore…“ “price” : 18} { "index": { "_id": 4 }} { ”genre" : “Children”, ”productID" : "QQPX-R-3956-#aD8", “title” : “Fairy Tale”, , “content” : “Once upon a time…“, “price” : 30}
  55. 55. Boolean queries • SQL representation SELECT movie FROM movies WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30) • ES DSL GET /my_movie_catalog/movies/_search { "query" : { "filtered" : { "filter" : { "bool" : { "should" : [ { "term" : {"price" : 20}}, { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} ], "must_not" : { "term" : {"price" : 30} …
  56. 56. Content based similarity queries (MLT) { "more_like_this" : { "fields" : ["title", "description"], "like_text" : "Once upon a time", "min_term_freq" : 1, "max_query_terms" : 12 } } • The More Like This Query (MLT Query) finds documents that are "like" a given set of documents. In order to do so, MLT selects a set of representative terms of these input documents, forms a query using these terms, executes the query and returns the results.
  57. 57. Similar to a given document{ "more_like_this" : { "fields" : ["title", "description"], "docs" : [ { "_index" : "imdb", "_type" : "movies", "_id" : "1" }, { "_index" : "imdb", "_type" : "movies", "_id" : "2" }], "min_term_freq" : 1, "max_query_terms" : 12 } }
  58. 58. Built-in functions • Suppose we want to boost movies by popularity (base-line of many RS)
  59. 59. Popularity-based boosting GET /my_movie_catalog/movies/post/_search { "query": { "function_score": { "query": { "multi_match": { "query": "popularity", "fields": [ "title", "content" ] } }, "field_value_factor": { "field": "votes", "modifier": "log1p" } } } }
  60. 60. Geo-Location • Suppose we want to build a location-aware recommender system
  61. 61. Decay functions • Supported decay functions • Linear • Gauss • Exp • Also supported • random_score GET /_search { "query": { "function_score": { "functions": [ { "gauss": { "location": { "origin": { "lat": "offset": "2km", "scale": "3 } } }, { "gauss": { "price": { "origin": "50", "offset": "50", "scale": "20" } }, "weight": 2 …
  62. 62. ES scoring script • Trickier pricing and margin based scoring if (price < threshold) { profit = price * margin } else { profit = price * (1 - discount) * margin } return profit / target
  63. 63. ES Scoring Script GET /_search { "function_score": { "functions": [ { ...location clause... }, { ...price clause... }, { "script_score": { "params": { "threshold": 80, "discount": 0.1, "target": 10 }, "script": "price = doc['price'].value; margin = doc['margin'].value; if (price < threshold) { return price * margin / target };return price * (1 - discount) * margin / target; "} …
  64. 64. Limitations of ranking using ES practical scoring function • Stateless computation • Meant primarily for text search • Hard to represent context and history • Limited complexity (simple math functions only) • Nevertheless, original score should not be discarded as it may become handy!
  65. 65. Scoring plugin in ES public class PredictorPlugin extends AbstractPlugin { @Override public String name() { return getClass().getName(); } @Override public String description() { return "Simple plugin to predict values."; } public void onModule(ScriptModule module) { module.registerScript( PredictorScoreScript.SCRIPT_NAME, PredictorScoreScript.Factory.class); } }
  66. 66. ML-Scoring evaluator using Spark class SparkPredictorEngine[M](val readPath: String, val spHelp: SparkModelHelpers[M]) extends PredictorEngine { private var _model: ModelData[M] = ModelData[M]() override def getPrediction(values: Collection[IndexValue]) = { if (_model.clf.nonEmpty) { val v = ReadUtil.cIndVal2Vector( values, _model.mapper) _model.clf.get.predict(v) } else { throw new PredictionException("Empty model"); } } def readModel() = _model = spHelp.readSparkModel(readPath) def getModel: ModelData[M] = _model
  67. 67. ML-Scoring query { "query": { "function_score": { "query": { "match_all": {} }, "functions": [ { "script_score": { "script": "search-predictor", "lang": "native", "params": {} } } ], "boost_mode": "replace" } } }
  68. 68. https://github.com/sdhu/elasti csearch-prediction
  69. 69. Potential issues • Performance • It may be a problem if the search space is very large and/or the computation to intensive • Operations • Code running on a key infrastructure • Versioning and binary compatibility
  70. 70. Summary • Importance of the whole picture – RS seen from the lenses of the whole elephant • RS research is a new field in comparison to IR • Scalability is hard! Why not learn from all of RS’s cousins: • Search • Distributed systems • Databases • Machine learning • Content analysis • … • Bridging the gap between research and engineering is an ongoing effort
  71. 71. References • Baeza-Yates, R., & Ribeiro-Neto, B. 2011. Modern information retrieval. New York: ACM press. • Chirita, P. A., Firan, C. S., & Nejdl, W. 2007. Personalized query expansion for the web. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 7-14). ACM. • Croft, W. B., Metzler, D., & Strohman, T. 2010. Search engines: Information retrieval in practice. Reading: Addison-Wesley. • Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computational linguistics, 19(1), 61-74. • Elastic, Elasticsearch: RESTful, Distributed Search & Analytics. 2015. https://www.elastic.co/products/elasticsearch. • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18. • Ihaka, R., & Gentleman, R. 1996. R: a language for data analysis and graphics. Journal of computational and graphical statistics, 5(3), 299-314.
  72. 72. References • Kantor, P. B., Rokach, L., Ricci, F., & Shapira, B. 2011. Recommender systems handbook. Springer. • Manning, C. D., Raghavan, P., & Schütze, H. 2008. Introduction to information retrieval. Cambridge: Cambridge university press. • Qiu, F., & Cho, J. 2006. Automatic identification of user interest for personalized search. In Proceedings of the 15th international conference on World Wide Web (pp. 727-736). ACM. • Sun, J. T., Zeng, H. J., Liu, H., Lu, Y., & Chen, Z. 2005. Cubesvd: a novel approach to personalized web search. In Proceedings of the 14th international conference on World Wide Web (pp. 382-390). ACM. • Xing, B., & Lin, Z. 2006. The impact of search engine optimization on online advertising market. In Proceedings of the 8th international conference on Electronic commerce: The new e-commerce: innovations for conquering current barriers, obstacles and limitations to conducting successful business on the internet (pp. 519-529). ACM. • Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (Vol. 10, p. 10)
  73. 73. Additional Credits • Doug Kang • Data Scientist, Verizon OnCue • Federico Ponte • System Engineer from Mahisoft • Yessika Labrador • Data Engineer from Mahisoft

Notes de l'éditeur

  • Real-life recommender systems: How it works in the industry, outside of academics settings
  • Rating prediction
    Ranking
    Similarity
    Filtering
    UI/Presentation

    cannot escape our own perspectives
    Which is why, if we want to do good work—and particularly if we want to innovate—we need to have ‘other.’
    why the only way to see our biases is through other people.
    -trunk - snake
    -tusk - sword
    -stomach - whale
    -tail - reed
    -leg - tree
    the elephant in the room is truth
  • KN: items such as apartments and cars are not purchased very often, therefore rating-based systems often do not perform well/ For example, only financial services must be recommended that support the investment period
  • Diana
  • Similarity calculation aka similarity-based retrieval
    Rocchio’s relevance feedback in 1960
  • Challenge to deal with unsatiesdiable REQ and empty result sets
    User def- explicit
    Utility scoring rules predifined
    Conjoint- analysis of of past interaction

    Geo sim
  • Database Systems
    Data retrieval with no scoring
    AI Systems
    Games: e.g. Chess – possible next moves with associated move quality score
    E-Commerce Systems
    Sorting can be based of relevance score, proximity (geo-location), price, etc.
    Recommender Systems
    Taste scoring based on content and/or user similarity
    Advertising Systems
    Two-way search (matching) and optimization problem with scores modeled as bids
  • Pro Boolean:
    Popular retrieval model because:
    Easy to understand for simple queries.
    Clean formalism.
    Boolean models can be extended to include ranking.
    Reasonably efficient implementations possible for normal queries
    Cons Booelan:
    Very rigid: AND means all; OR means any.
    Difficult to express complex user requests.
    Difficult to control the number of documents retrieved.
    All matched documents will be returned.
    Difficult to rank output.
    All matched documents logically satisfy the query.
    Difficult to perform relevance feedback.
    If a document is identified by the user as relevant or irrelevant, how should the query be modified?

    Pro VSP
    Similarity based on occurrence frequencies of keywords in query and document.
    Automatic relevance feedback can be supported:
    Relevant documents “added” to query.
    Irrelevant documents “subtracted” from query.
    Cons VSP:
    How to determine important words in a document?
    Word sense?
    Word n-grams (and phrases, idioms,…)  terms (old school)
    How to determine the degree of importance of a term within a document and within the entire collection?
    How to determine the degree of similarity between a document and the query?
    In the case of the web, what is the collection and what are the effects of links, formatting information, etc.?



  • The inner product is unbounded.

    Favors long documents with a large number of unique terms.

    Measures how many terms matched but not how many terms are not matched.

    Explain what an index is
  • Search engines are widely used to solve non-IR search problems and here is why:
    Available: popular open source search engines (e.g. Apache SOLR and Elastic Search) based on a mature search library (Apache Lucene)
    Fast and Scalable : distributed (inverted) indexing and retrieval via scattering/gathering techniques
    Integrates well with existing data stores (SQL and No-SQL)
  • They implement text-based scoring, including fuzzy match and some variation of TF-IDF or Okapi BM25, etc.
    Filter using querying a search engine
    Rank results based on a pre-generated ML predictive model
  • Search engines are not ACID databases. By nature they are not transactional become eventually consistent.
  • There is more to Recsys than algorithms and ranking
    - Retrieval
    - User Interface & Feedback
    - Data
    - AB Testing
    - Systems & Architectures
  • Machine Learning can allow learning a user model or profile of a particular user based on:
    Sample interaction
    Rated examples
    This model or profile can then be used to:
    Recommend items
    Filter information
    Predict behavior
  • Popularity is the obvious baseline
    Ratings prediction is a clear secondary data input
    that allows for personalization
    1. Pointwise
    - Ranking function minimizes loss function defined on
    individual relevance judgment
    - Ranking score based on regression or classification
    - Ordinal regression, Logistic regression, SVM, GBDT, …
    2. Pairwise
    -Loss function is defined on pair-wise preferences
    -Goal: minimize number of inversions in ranking
    -Ranking problem is then transformed into the binary
    classification problem
    -RankSVM, RankBoost, RankNet, Frank…
    3.Listwise
    - Indirect Loss Function
    − RankCosine: similarity between ranking list
    and ground truth as loss function
    − ListNet: KL-divergence as loss function by
    defining a probability distribution
    − Problem: optimization of listwise loss function
    may not optimize IR metrics
    - Directly optimizing IR measures (difficult since they
    are not differentiable)
  • You write a single program  similar to DryadLINQ
    Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops
    Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization
    Mention cached vars useful for some workloads that won’t be shown here
    Mention it’s all designed to be easy to distribute in a fault-tolerant fashion
  • Basic statistics
    summary statistics
    correlations
    stratified sampling
    hypothesis testing
    random data generation
    Classification and regression
    linear models (SVMs, logistic regression, linear regression)
    naive Bayes
    decision trees
    ensembles of trees (Random Forests and Gradient-Boosted Trees)
    isotonic regression
    Collaborative filtering
    alternating least squares (ALS)
    Clustering
    k-means
    Gaussian mixture
    power iteration clustering (PIC)
    latent Dirichlet allocation (LDA)
    streaming k-means
    Dimensionality reduction
    singular value decomposition (SVD)
    principal component analysis (PCA)
    Feature extraction and transformation
    Frequent pattern mining
    FP-growth
    association rules
    PrefixSpan
    Evaluation metrics
    PMML model export
    Optimization (developer)
    stochastic gradient descent
    limited-memory BFGS (L-BFGS)
  • There is more to Recsys than algorithms and ranking
    - Retrieval
    - User Interface & Feedback
    - Data
    - AB Testing
    - Systems & Architectures
  • There is more to Recsys than algorithms and ranking
    - Retrieval
    - User Interface & Feedback
    - Data
    - AB Testing
    - Systems & Architectures
  • Performance
    It may be a problem if the search space is very large and/or the computation to intensive
    Operations
    Code running on a key infrastructure
    People are more hesitant to touch an infrastructure/DB component such as Elasticsearch. Similar concerns exist surrounding DB stored procedures.
    No way to sandbox a native plugin
    Requires strong automated regression and performance testing
    How handle versioning and binary compatibility
    Potential deployment issues
    Upgrades to Elastic search, the plugin code and/or the models may present challenges

×