Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Efficient Features for 
Movie Recommendation 
Systems 
Project presentation 
Suvir Bhargav
Outline 
● Motivation and Why movie reviews 
● Problem statement 
● How? or the overall system 
● Text preprocessing appro...
Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html 
Motivation
Motivation 
● movie genres are not enough. 
● classify movies 
○ keywords 
○ moods 
○ imdb ratings 
○ micro genres
micro genres 
source: http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/28267...
Why movie reviews? 
Source: a sample user written movie review from imdb
Problem statement 
● Feature extraction from user reviews of 
movies 
● Use extracted features to find similar 
movies.
The overall system 
Movie reviews corpus 
● preprocessing 
○ tokenization, stopwords, lemmatized. 
● post processing 
○ to...
Text preprocessing 
tokenization, stopwords, lemmatized. 
Simple information extraction 
Figure credit to nltk book.
Post processing 
Document representation: Vector Space Model (VSM) 
Picture credit: pyevolve
Post processing: generative model 
source: David blei’s slide
Post processing: LDA 
For each document in the collection, the words can be generated 
in two stage process 
1) Randomly c...
Movie topics from a reviews corpus
Similarity Measure 
● Cosine Similarity 
● KL divergence 
● Hellinger distance
Similarity Measure 
Cosine Similarity
Similarity Measure 
Hellinger Distance
The overall system: implementation 
Movie reviews corpus 
● preprocessing 
○ nltk and gensim’s simple preprocessing. 
● po...
Experimental setup 
Movie reviews corpus of 1k movies 
reviews data source: imdb
Experimental setup 
Evaluation criteria
Conclusion 
● Movie topics as efficient features for RS 
○ represents movies by underlying semantic patterns 
○ useful for...
Future directions 
● Movie review preprocessing 
○ bigram, trigrams. 
○ create multi-word movie keywords or language 
cons...
Thank You 
Questions ? 
Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpape...
Extra slides 
List of extra slides and notes 
● Original LDA paper 
● introduction to probabilistic topic modeling 
● and ...
LDA
Prochain SlideShare
Chargement dans…5
×

Movie topics- Efficient features for movie recommendation systems

User written movie reviews carry substantial amounts of movie related features such as description of location, time period, genres, characters, etc. Using natural language processing and topic modeling based techniques, it is possible to extract features from movie reviews and find movies with similar features.

  • Identifiez-vous pour voir les commentaires

Movie topics- Efficient features for movie recommendation systems

  1. 1. Efficient Features for Movie Recommendation Systems Project presentation Suvir Bhargav
  2. 2. Outline ● Motivation and Why movie reviews ● Problem statement ● How? or the overall system ● Text preprocessing approaches ● Postprocessing: movie topics from a reviews corpus ● Similarity ● Experimental setup and results
  3. 3. Thanks to Sean Lind, source: http://www.silveroakcasino.com/blog/posts/netflix/what-to-watch-on-netflix.html Motivation
  4. 4. Motivation ● movie genres are not enough. ● classify movies ○ keywords ○ moods ○ imdb ratings ○ micro genres
  5. 5. micro genres source: http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/
  6. 6. Why movie reviews? Source: a sample user written movie review from imdb
  7. 7. Problem statement ● Feature extraction from user reviews of movies ● Use extracted features to find similar movies.
  8. 8. The overall system Movie reviews corpus ● preprocessing ○ tokenization, stopwords, lemmatized. ● post processing ○ topic modeling: Movie topics from a reviews corpus ● similarity measure ○ return movies with similar topics distribution
  9. 9. Text preprocessing tokenization, stopwords, lemmatized. Simple information extraction Figure credit to nltk book.
  10. 10. Post processing Document representation: Vector Space Model (VSM) Picture credit: pyevolve
  11. 11. Post processing: generative model source: David blei’s slide
  12. 12. Post processing: LDA For each document in the collection, the words can be generated in two stage process 1) Randomly choose a distribution over topics. 2) For each word in the document a) Randomly choose a topic from the distribution over topics in step 1. b) Randomly choose a word from the corresponding distribution over the vocabulary Documents exhibit multiple topics
  13. 13. Movie topics from a reviews corpus
  14. 14. Similarity Measure ● Cosine Similarity ● KL divergence ● Hellinger distance
  15. 15. Similarity Measure Cosine Similarity
  16. 16. Similarity Measure Hellinger Distance
  17. 17. The overall system: implementation Movie reviews corpus ● preprocessing ○ nltk and gensim’s simple preprocessing. ● post processing ○ gensim python wrapper to MALLET ○ index topic distribution of query movies, q and 1k movies corpus, C. ● similarity measure ○ python numpy implementation ○ apply distance metric on indexed q and C. ○ sort and pick top 5 movies.
  18. 18. Experimental setup Movie reviews corpus of 1k movies reviews data source: imdb
  19. 19. Experimental setup Evaluation criteria
  20. 20. Conclusion ● Movie topics as efficient features for RS ○ represents movies by underlying semantic patterns ○ useful for capturing movie genre and mood. ○ but not so well with plot. ○ user written movie reviews are useful movie meta-data. ● The developed prototype ○ easy to add more movie meta-data ○ python allows scalability. ○ Topics as an explanation needs further tuning.
  21. 21. Future directions ● Movie review preprocessing ○ bigram, trigrams. ○ create multi-word movie keywords or language construction ● Building complex topic models ○ Hierarchical LDA ○ author-topic model ■ include authorship information. ■ similarity between authors
  22. 22. Thank You Questions ? Image src: http://www.brinvy.biz/177215/batman-catching-a-ride-on-supermans-back-funny-hd-wallpaper-x.html
  23. 23. Extra slides List of extra slides and notes ● Original LDA paper ● introduction to probabilistic topic modeling ● and A. Huang’s Similarity measures for text document clustering ● Another good LDA description ● Integrating out multinomial parameters in LDA ● language construction in micro genres
  24. 24. LDA

×