SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Near Real-time Webpage Recs!
“One at a time”!
Using Content Features
Presented at Text-By-The-Bay, SF 
Ashok Venkatesan
Sr. Research Engineer, StumbleUpon
•  Overview About us
•  Recommendations (Recs) What, How and More
•  Content Understanding Overview, Content Categorization
•  Serendipitous Recs Methodology
•  Future
Agenda
OVERVIEW	
  
StumbleUpon – Choose Topics, Discover Content
Bookmark, Organize and Share
RECOMMENDATIONS	
  
Recommendations – Matching User With Content
1. Understand User
 2. Understand Content
 3. Recommend
 4. Get Feedback
TELEVISON
MUSIC
TRENDING
FRIENDS
LIKEMINDED
USERS
EXPERTS
ANIMALS
DOGS
PHOTOGRAPHY
MOVIES
ARTS
HUMOR
•  Understand User
–  User Quality
–  Latent Interests
•  Understand Content
–  Content Quality 
–  {Spam, Dupes} Detection
–  Spotting Dead + Parked Pages
–  Handling different content types
•  Recommendations
–  Rec Quality
–  Managing item supply/demand + churn
–  Keep learning
•  Business Related
–  One rec at a time
–  Don’t show already seen content
–  Constantly changing product/use-cases
Problems and Challenges
Architecture I
Ingestion Queue	
Discovery Queue	
Content
Analysis
MySQL	
Recommendation
Engine	
1. INGESTION
Cold Start
Model	
HBase	
 ES	
New
Content	
Event
Consumers	
3. OFFLINE COMPUTATIONS
2. CHECK QUALITY
4. RECS
5. ONLINE COMPUTATION
Rec Models	
Rec Models	
Rec Models	
Event Queue
Architecture II
Rec Models	
Rec Models	
Rec Models	
Rec Models	
Filter
 Sort
Method
Recommendation Strategy
Cache
Mixer
Event
Consumers	
Recommendation Engine
1.  Ingestion Entry point for items; Feature
extraction
2.  Initial Recs
–  Cold start Optimize for maximum expected
positive ratings and satisfy item demand
3.  Head Recs
–  Trending Popular in the short run
–  Timeless Popular in the long run
4.  Tail Recs
–  Collaborative Filtering Nearest neighbors
based on user signals
–  Serendipitous Recs Unexpected but relevant
Item Lifecycle in a Recommender System
•  What?
–  Recommend the “unexpected but useful”
–  “Go beyond relevance” and look for “interestingness”
•  Why?
–  “Helps avoid tunnel vision”
–  Allows exploration, enables true discovery
•  Challenges
–  Figuring out how to serve good content that is not random, is
unexpected but useful…Hmm
–  Measuring/Controlling serendipity
Serendipitous Recs
Bordino I. et al., Penguins in sweaters, or serendipitous entity search on user-generated content, CIKM 2013
Serendipitous Surfing
CONTENT	
  UNDERSTANDING	
  
•  Some relevant features
–  Topics
–  Keywords
–  Language
–  Mime Type
–  Content Type
–  Number of Ads
–  Number of Links
–  Responsive Design
–  Overlays/Popups and more…
Characterizing Content
•  What
–  Discover underlying topic structure
–  Annotate documents
–  Index/Search documents
•  Applications
–  Content Categorization
–  Dimensionality Reduction
–  User modeling
Topic Models
Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4
(2012): 77-84.
•  Requirements
–  Categorize documents to a predefined topic taxonomy
–  Cluster documents into broader groups
•  Applications
–  Search and recommendations
–  Evaluate user’s categorization/re-categorization of documents
–  Discovering related interests
Content Categorization
Feature Extraction
Wikipedia
Check
Stem	
Detect
Language 
Parse
Build n-grams
Cleanup
Remove
Boilerplate
Categorize
Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia."
Artificial Intelligence 194 (2013): 222-239.
•  Naïve Bayes Classifier!
–  Words are generated from one mixture!
–  Words are independently distributed given
topic!
Content Categorization: Discovering topics
zd
wdn
Nd M
•  Supervised and generative (fully probabilistic)
•  Efficient in both training and classification
•  Easy to Implement an online version
•  Dependent on a static vocabulary (retrain if vocabulary
changes)
•  Very restricted means of identifying semantic groups from
a mixture model 
Properties
Content Categorization: Discovering Semantic Groups
3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/
•  Unsupervised (Classic LDA) and generative
•  Well suited for domain adaptation (taxonomy shift)
•  Allows making topic clusters as loose/tight as
needed
–  controls the peak-ness of the document-topic
distributions
–  controls the peak-ness of the topic-word
distributions
•  Can be extended to discover relations,
hierarchies, etc.,
Properties
α
η
•  Periodically evaluate the model
•  Perplexity
–  Measure of how surprised the model is on an average when having to guess
between k equally probable choices.
–  The average log probability of the trained model having seen the test samples
•  Use human judgment from word intrusion and topic
intrusion tasks

•  Good topic associations can be initialized from previous
trainings or from separate topic clustering
Evaluation + Relearning
2Entropy
= 2
− p log p∑
Topic Mixtures
SERENDIPITOUS	
  RECS	
  
•  Dimensionality Reduction
–  Build LDA model using “Head” URLs
–  Use the model to classify “Tail” URLs in Latent Topic Space
•  Document Graph
–  Compute pairwise similarity between documents with topic
overlaps Cosine Similarity, Weighted Jaccard
–  Build a graph where documents make up the nodes and the
similarity score make up the edge weights.
•  Page Rank
–  Run topic sensitive page rank over the document graph.
–  Spot influential documents per topic and index for fast retrieval
Methodology
Classic Pagerank
Image courtesy: http://parkcu.com/blog/pagerank/
Page, Lawrence, et al. "The PageRank citation ranking: Bringing order to the web." (1999).
Topic Sensitive or Personalized Pagerank
Image courtesy: http://parkcu.com/blog/pagerank/
Haveliwala, Taher H. "Topic-sensitive pagerank." Proceedings of the 11th
international conference on World Wide Web. ACM, 2002.
•  A/B Testing
– Measure the difference in user behavior
(implicit/explicit signals and retention):
•  “A Recommended item” vs. “Randomly picked item
from the set”
•  “Serendipity free stumbling session” vs. “Sessions
with serendipitous recommendations”
Evaluation
•  More online computations
•  Model/Feature Improvements
– E.g. authorship, published date, device
optimizations etc.,
•  Dupe detection improvements
•  Better Recs! J
Ongoing Work
Stack
THANKS	
  

Contenu connexe

Tendances (10)

COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013
 
Terminology Management in DITA
Terminology Management in DITATerminology Management in DITA
Terminology Management in DITA
 
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
 
Toelf reading strategies
Toelf reading strategiesToelf reading strategies
Toelf reading strategies
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineering
 
Social development library training
Social development library trainingSocial development library training
Social development library training
 
Databases mtcp2
Databases mtcp2Databases mtcp2
Databases mtcp2
 
Writing Seminar Dowland
Writing Seminar DowlandWriting Seminar Dowland
Writing Seminar Dowland
 
Distributed machine learning examples
Distributed machine learning examplesDistributed machine learning examples
Distributed machine learning examples
 
Cutting Edge Technology used in ePADD
Cutting Edge Technologyused in ePADDCutting Edge Technologyused in ePADD
Cutting Edge Technology used in ePADD
 

Similaire à Near Real-time Web-Page Recs Using Content Features

Data analysis – qualitative data presentation 2
Data analysis – qualitative data   presentation 2Data analysis – qualitative data   presentation 2
Data analysis – qualitative data presentation 2
Azura Zaki
 
Lecture 9.28.10
Lecture 9.28.10Lecture 9.28.10
Lecture 9.28.10
VMRoberts
 
Webscale Discovery and Information Literacy
Webscale Discovery and Information LiteracyWebscale Discovery and Information Literacy
Webscale Discovery and Information Literacy
Charleston Conference
 
Graduate Student Seminar: Writing an Academic Paper
Graduate Student Seminar: Writing an Academic PaperGraduate Student Seminar: Writing an Academic Paper
Graduate Student Seminar: Writing an Academic Paper
dalwritingcentre
 

Similaire à Near Real-time Web-Page Recs Using Content Features (20)

Encyclopedias
EncyclopediasEncyclopedias
Encyclopedias
 
Designing e-Learning Objects
Designing e-Learning ObjectsDesigning e-Learning Objects
Designing e-Learning Objects
 
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
 
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
Presentation to 2014 University of Guelph Accessibility Conference Perspectiv...
 
Data analysis – qualitative data presentation 2
Data analysis – qualitative data   presentation 2Data analysis – qualitative data   presentation 2
Data analysis – qualitative data presentation 2
 
Lecture 9.28.10
Lecture 9.28.10Lecture 9.28.10
Lecture 9.28.10
 
Dissertation literature search
Dissertation literature searchDissertation literature search
Dissertation literature search
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled Technology
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
Webscale Discovery and Information Literacy
Webscale Discovery and Information LiteracyWebscale Discovery and Information Literacy
Webscale Discovery and Information Literacy
 
Webscale discovery and information literacy
Webscale discovery and information literacyWebscale discovery and information literacy
Webscale discovery and information literacy
 
Using exemplars to discuss variable source quality in computer science
Using exemplars to discuss variable source quality in computer scienceUsing exemplars to discuss variable source quality in computer science
Using exemplars to discuss variable source quality in computer science
 
Design to Refine: Developing a tunable information architecture
Design to Refine: Developing a tunable information architectureDesign to Refine: Developing a tunable information architecture
Design to Refine: Developing a tunable information architecture
 
Corp3400 econ3530 ente3532 2015
Corp3400 econ3530 ente3532 2015Corp3400 econ3530 ente3532 2015
Corp3400 econ3530 ente3532 2015
 
Written expression 4 presentation
Written expression 4 presentationWritten expression 4 presentation
Written expression 4 presentation
 
Written expression 4 presentation
Written expression 4 presentationWritten expression 4 presentation
Written expression 4 presentation
 
Enhancing Information Retrieval by Personalization Techniques
Enhancing Information Retrieval by Personalization TechniquesEnhancing Information Retrieval by Personalization Techniques
Enhancing Information Retrieval by Personalization Techniques
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Graduate Student Seminar: Writing an Academic Paper
Graduate Student Seminar: Writing an Academic PaperGraduate Student Seminar: Writing an Academic Paper
Graduate Student Seminar: Writing an Academic Paper
 
Knowledge engineering and the Web
Knowledge engineering and the WebKnowledge engineering and the Web
Knowledge engineering and the Web
 

Dernier

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Dernier (20)

Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 

Near Real-time Web-Page Recs Using Content Features

  • 1. Near Real-time Webpage Recs! “One at a time”! Using Content Features Presented at Text-By-The-Bay, SF Ashok Venkatesan Sr. Research Engineer, StumbleUpon
  • 2. •  Overview About us •  Recommendations (Recs) What, How and More •  Content Understanding Overview, Content Categorization •  Serendipitous Recs Methodology •  Future Agenda
  • 4. StumbleUpon – Choose Topics, Discover Content
  • 7. Recommendations – Matching User With Content 1. Understand User 2. Understand Content 3. Recommend 4. Get Feedback TELEVISON MUSIC TRENDING FRIENDS LIKEMINDED USERS EXPERTS ANIMALS DOGS PHOTOGRAPHY MOVIES ARTS HUMOR
  • 8. •  Understand User –  User Quality –  Latent Interests •  Understand Content –  Content Quality –  {Spam, Dupes} Detection –  Spotting Dead + Parked Pages –  Handling different content types •  Recommendations –  Rec Quality –  Managing item supply/demand + churn –  Keep learning •  Business Related –  One rec at a time –  Don’t show already seen content –  Constantly changing product/use-cases Problems and Challenges
  • 9. Architecture I Ingestion Queue Discovery Queue Content Analysis MySQL Recommendation Engine 1. INGESTION Cold Start Model HBase ES New Content Event Consumers 3. OFFLINE COMPUTATIONS 2. CHECK QUALITY 4. RECS 5. ONLINE COMPUTATION Rec Models Rec Models Rec Models Event Queue
  • 10. Architecture II Rec Models Rec Models Rec Models Rec Models Filter Sort Method Recommendation Strategy Cache Mixer Event Consumers Recommendation Engine
  • 11. 1.  Ingestion Entry point for items; Feature extraction 2.  Initial Recs –  Cold start Optimize for maximum expected positive ratings and satisfy item demand 3.  Head Recs –  Trending Popular in the short run –  Timeless Popular in the long run 4.  Tail Recs –  Collaborative Filtering Nearest neighbors based on user signals –  Serendipitous Recs Unexpected but relevant Item Lifecycle in a Recommender System
  • 12. •  What? –  Recommend the “unexpected but useful” –  “Go beyond relevance” and look for “interestingness” •  Why? –  “Helps avoid tunnel vision” –  Allows exploration, enables true discovery •  Challenges –  Figuring out how to serve good content that is not random, is unexpected but useful…Hmm –  Measuring/Controlling serendipity Serendipitous Recs Bordino I. et al., Penguins in sweaters, or serendipitous entity search on user-generated content, CIKM 2013
  • 15. •  Some relevant features –  Topics –  Keywords –  Language –  Mime Type –  Content Type –  Number of Ads –  Number of Links –  Responsive Design –  Overlays/Popups and more… Characterizing Content
  • 16. •  What –  Discover underlying topic structure –  Annotate documents –  Index/Search documents •  Applications –  Content Categorization –  Dimensionality Reduction –  User modeling Topic Models Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.
  • 17. •  Requirements –  Categorize documents to a predefined topic taxonomy –  Cluster documents into broader groups •  Applications –  Search and recommendations –  Evaluate user’s categorization/re-categorization of documents –  Discovering related interests Content Categorization
  • 18. Feature Extraction Wikipedia Check Stem Detect Language Parse Build n-grams Cleanup Remove Boilerplate Categorize Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia." Artificial Intelligence 194 (2013): 222-239.
  • 19. •  Naïve Bayes Classifier! –  Words are generated from one mixture! –  Words are independently distributed given topic! Content Categorization: Discovering topics zd wdn Nd M
  • 20. •  Supervised and generative (fully probabilistic) •  Efficient in both training and classification •  Easy to Implement an online version •  Dependent on a static vocabulary (retrain if vocabulary changes) •  Very restricted means of identifying semantic groups from a mixture model Properties
  • 21. Content Categorization: Discovering Semantic Groups 3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022. image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/
  • 22. •  Unsupervised (Classic LDA) and generative •  Well suited for domain adaptation (taxonomy shift) •  Allows making topic clusters as loose/tight as needed –  controls the peak-ness of the document-topic distributions –  controls the peak-ness of the topic-word distributions •  Can be extended to discover relations, hierarchies, etc., Properties α η
  • 23. •  Periodically evaluate the model •  Perplexity –  Measure of how surprised the model is on an average when having to guess between k equally probable choices. –  The average log probability of the trained model having seen the test samples •  Use human judgment from word intrusion and topic intrusion tasks •  Good topic associations can be initialized from previous trainings or from separate topic clustering Evaluation + Relearning 2Entropy = 2 − p log p∑
  • 26. •  Dimensionality Reduction –  Build LDA model using “Head” URLs –  Use the model to classify “Tail” URLs in Latent Topic Space •  Document Graph –  Compute pairwise similarity between documents with topic overlaps Cosine Similarity, Weighted Jaccard –  Build a graph where documents make up the nodes and the similarity score make up the edge weights. •  Page Rank –  Run topic sensitive page rank over the document graph. –  Spot influential documents per topic and index for fast retrieval Methodology
  • 27. Classic Pagerank Image courtesy: http://parkcu.com/blog/pagerank/ Page, Lawrence, et al. "The PageRank citation ranking: Bringing order to the web." (1999).
  • 28. Topic Sensitive or Personalized Pagerank Image courtesy: http://parkcu.com/blog/pagerank/ Haveliwala, Taher H. "Topic-sensitive pagerank." Proceedings of the 11th international conference on World Wide Web. ACM, 2002.
  • 29. •  A/B Testing – Measure the difference in user behavior (implicit/explicit signals and retention): •  “A Recommended item” vs. “Randomly picked item from the set” •  “Serendipity free stumbling session” vs. “Sessions with serendipitous recommendations” Evaluation
  • 30. •  More online computations •  Model/Feature Improvements – E.g. authorship, published date, device optimizations etc., •  Dupe detection improvements •  Better Recs! J Ongoing Work
  • 31. Stack