Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Machine Learning &
Big Data @
Andy Sloane
@a1k0n
http://a1k0n.net
Madison Big Data Meetup
Jan 27, 2015
Big data?
60M Monthly Active Users (MAU)
50M tracks in our catalog
...But many are identical copies from different
release...
Big data?
Raw material: application logs, delivered via Apache
Kafka
Wake Me Up by Avicii has been played 330M times, by
~...
Hadoop @ Spotify
900 nodes (all in London datacenter)
34 TB RAM total
~16000 typical concurrent tasks (mappers/reducers)
2...
What do we need ML for?
Recommendations
Related Artists
Radio
Recommendations
The Discover page
4M tracks x 60M active users, rebuilt daily
The Discover page
Okay, but how do we come up with recommendations?
Collaborative filtering!
Collaborative filtering
Collaborative filtering
Great, but how does that actually work?
Each time a user plays something, add it to a matrix
Compu...
Collaborative filtering
So compute some distance between every pair of rows
and columns
That's just O( ) = O( ) operations...
Collaborative filtering
Latent factor models
Instead, we use a "small" representation for each user &
item: -dimensional v...
Why vectors?
Very compact representation of musical style or user's
taste
Only like 40-200 elements (2 shown above for
ill...
Why vectors?
Dot product between items = similarity between items
Dot product between vectors = good/bad
recommendation
us...
Recommendations via dot products
Another example of tracks in two
dimensions
Implicit Matrix Factorization
Hu, Koren, Volinsky - Collaborative Filtering for Implicit
Feedback Datasets
Tries to predic...
Goal: make close to 1 for things each user has
listened to, 0 for everything else.
Implicit Matrix Factorization
⋅xu y
i
—...
Solution: alternate solving for all users :
and all items :
Alternating Least Squares
xu
= ( Y + ( − I)Y + λIxu Y
T
Y
T
C
...
Alternating Least Squares
Key point: each iteration is linear in size of input, even
though we are solving for all users x...
Alternating Least Squares
Adding lots of stuff up
Problem: any user (60M) can play any item (4M)
thus we may need to add a...
Solution: Split the data into a matrix
Most recent run made a 14 x 112 grid
Adding lots of stuff up
Input is a bunch of tuples
is the same modulo K for all users
is the same modulo L for all items
e.g., if K = 4, mapper #1...
Add up vectors from every data point
Then flip users ↔items and repeat!
Adding stuff up
(user, item, count)
def mapper(sel...
Alternating Least Squares
Implemented in Java Map-Reduce framework which
runs other models, too
After about 20 iterations,...
60M users x 4M recommendable items
Finding Recommendations
For each user, how do we find the best items given
their vector...
Approximate Nearest Neighbors /
Locality-Sensitive Hashing
Annoy - github.com/spotify/annoy
Annoy - github.com/spotify/annoy
Pre-built read-only database of item vectors
Internally, recursively splits random hyperp...
Generating recommendations
Annoy index for all items is only 1.2GB
I have one on my laptop... Live demo!
Could serve up ne...
Generating recommendations in parallel
Send annoy index in distributed cache, load it via mmap
in map-reduce process
Reduc...
Related Artists
Related Artists
Great for music discovery
Essential for finding believable reasons for latent
factor-based recommendations...
Similar items use cosine distance
Cosine is similar to dot product; just add a
normalization step
Helps "factor out" popul...
Related Artists
How we build it
Similar to user recommendations, but with more
models, not necessarily collaborative filte...
Radio
ML-wise, exactly the same as Related Artists!
Radio
For each track, generate candidates with ANN from
each model
Score w/ ...
Upcoming work
Deep learning based item similarity
http://benanne.github.io/2014/08/05/spotify-cnns.html
Upcoming work
Audio fingerprint based
content deduplication
~1500 Echo Nest Musical Fingerprints per track
based matching ...
Thanks!
I can be reached here:
Andy Sloane
Email:
Twitter:
Special thanks to , whose slides I
plagiarized mercilessly
andy...
Prochain SlideShare
Chargement dans…5
×

Machine learning @ Spotify - Madison Big Data Meetup

13 590 vues

Publié le

I cover how music discovery works at Spotify -- not only the models and algorithms, but the engineering challenges of making them practical.

Publié dans : Données & analyses
  • DOWNLOAD FULL BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Machine learning @ Spotify - Madison Big Data Meetup

  1. 1. Machine Learning & Big Data @ Andy Sloane @a1k0n http://a1k0n.net Madison Big Data Meetup Jan 27, 2015
  2. 2. Big data? 60M Monthly Active Users (MAU) 50M tracks in our catalog ...But many are identical copies from different releases (e.g. US and UK releases of the same album) ...and only 4M unique songs have been listened to >500 times
  3. 3. Big data? Raw material: application logs, delivered via Apache Kafka Wake Me Up by Avicii has been played 330M times, by ~6M different users "EndSong": 500GB / day ...But aggregated per-user play counts for a whole year fit in ~60GB ("medium data")
  4. 4. Hadoop @ Spotify 900 nodes (all in London datacenter) 34 TB RAM total ~16000 typical concurrent tasks (mappers/reducers) 2GB RAM per mapper/reducer slot
  5. 5. What do we need ML for? Recommendations Related Artists Radio
  6. 6. Recommendations
  7. 7. The Discover page 4M tracks x 60M active users, rebuilt daily
  8. 8. The Discover page Okay, but how do we come up with recommendations? Collaborative filtering!
  9. 9. Collaborative filtering
  10. 10. Collaborative filtering Great, but how does that actually work? Each time a user plays something, add it to a matrix Compute similarity, somehow, between items based on who played what
  11. 11. Collaborative filtering So compute some distance between every pair of rows and columns That's just O( ) = O( ) operations... O_O We need a better way... 60M 2 2 1.8 × 10 15 (BTW: Twitter has a decent approximation that can actually make this work, called DIMSUM: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum) I've tried it but don't have results to report here yet :(
  12. 12. Collaborative filtering Latent factor models Instead, we use a "small" representation for each user & item: -dimensional vectorsf (here, )f = 2 and approximate the big matrix with it.
  13. 13. Why vectors? Very compact representation of musical style or user's taste Only like 40-200 elements (2 shown above for illustration)
  14. 14. Why vectors? Dot product between items = similarity between items Dot product between vectors = good/bad recommendation user x item 2 x 4 = 8 -4 x 0 = 0 2 x -2 = -4 -1 x 5 = + -5 = -1
  15. 15. Recommendations via dot products
  16. 16. Another example of tracks in two dimensions
  17. 17. Implicit Matrix Factorization Hu, Koren, Volinsky - Collaborative Filtering for Implicit Feedback Datasets Tries to predict whether user listens to item :u i P = ≈ ( ) ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ X ⎛ ⎝ ⎜ ⎜ ⎜ Y T ⎞ ⎠ ⎟ ⎟ ⎟ is all item vectors, is all user vectorsY X "implicit" because users don't tell us what they like, we only observe what they do/don't listen to
  18. 18. Goal: make close to 1 for things each user has listened to, 0 for everything else. Implicit Matrix Factorization ⋅xu y i — user 's vector — item 's vector — 1 if user played item , 0 otherwise — "confidence", ad-hoc weight based on number of times user played item ; e.g., — regularization penalty to avoid overfitting xu u y i i p ui u i cui u i 1 + α ⋅ λ Minimize: + λ ( || | + || | ) ∑ u,i cui ( − )p ui x T u y i 2 ∑ u xu | 2 ∑ i y i | 2
  19. 19. Solution: alternate solving for all users : and all items : Alternating Least Squares xu = ( Y + ( − I)Y + λIxu Y T Y T C u ) −1 Y T C u p u⋅ y i = ( X + ( − I)X + λIy i X T X T C i ) −1 X T C i p ⋅i = x matrix, sum of outer products of all items same, except only items the user played = weighted -dimensional sum of items the user played YY T f f ( − I)YY T C u Y T C u p u f
  20. 20. Alternating Least Squares Key point: each iteration is linear in size of input, even though we are solving for all users x all items, and needs only memory to solvef 2 No learning rates, just a few tunable parameters ( , , )f λ α All you do is add stuff up, solve an x matrix problem, and repeat! f f We use dimensional vectors for recommendations f = 40 Matrix/vector math using numpy in Python, breeze in scala
  21. 21. Alternating Least Squares Adding lots of stuff up Problem: any user (60M) can play any item (4M) thus we may need to add any user's vector to any item's vector If we put user vectors in memory, it takes a lot of RAM! Worst case: 60M users * 40 dimensions * sizeof(float) = 9.6GB of user vectors ...too big to fit in a mapper slot on our cluster
  22. 22. Solution: Split the data into a matrix Most recent run made a 14 x 112 grid Adding lots of stuff up
  23. 23. Input is a bunch of tuples is the same modulo K for all users is the same modulo L for all items e.g., if K = 4, mapper #1 gets users 1, 5, 9, 13, ... One map shard (user, item, count) user item
  24. 24. Add up vectors from every data point Then flip users ↔items and repeat! Adding stuff up (user, item, count) def mapper(self, input): # Luigi-style python job user, item, count = parse(input) conf = AdHocConfidenceFunction(count) # e.g. 1 + alpha*count # add up user vectors from previous iteration term1 = conf * self.user_vectors[user] term2 = np.outer(user_vectors[user], user_vectors[user]) * (conf - 1) yield item, np.array([term1, term2]) def reducer(self, item, terms): term1, term2 = sum(terms) item_vector = np.solve( self.YTY + term2 + self.l2penalty * np.identity(self.dim), term1) yield item, item_vector
  25. 25. Alternating Least Squares Implemented in Java Map-Reduce framework which runs other models, too After about 20 iterations, we converge Each iteration takes about 20 minutes, so about 7-8 hours total Recomputed from scratch weekly User vectors recomputed daily, keeping items fixed So we have vectors, now what?
  26. 26. 60M users x 4M recommendable items Finding Recommendations For each user, how do we find the best items given their vector? Brute force is O(60M x 4M x 40) = O(9 peta-operations)! Instead, use an approximation based on locality sensitive hashing (LSH)
  27. 27. Approximate Nearest Neighbors / Locality-Sensitive Hashing Annoy - github.com/spotify/annoy
  28. 28. Annoy - github.com/spotify/annoy Pre-built read-only database of item vectors Internally, recursively splits random hyperplanes Nearby points likely on the same side of random split Builds several random trees (a forest) for better approximation Given an -dimensional query vector, finds similar items in database Index loads via mmap, so all processes on the same machine share RAM Queries are very, very fast, but approximate Python implementation available, Java forthcoming f
  29. 29. Generating recommendations Annoy index for all items is only 1.2GB I have one on my laptop... Live demo! Could serve up nearest neighbors at load time, but we precompute Discover on Hadoop
  30. 30. Generating recommendations in parallel Send annoy index in distributed cache, load it via mmap in map-reduce process Reducer loads vectors + user stats, looks up ANN, generates recommendations.
  31. 31. Related Artists
  32. 32. Related Artists Great for music discovery Essential for finding believable reasons for latent factor-based recommendations When generating recommendations, run through a list of related artists to find potential reasons
  33. 33. Similar items use cosine distance Cosine is similar to dot product; just add a normalization step Helps "factor out" popularity from similarity
  34. 34. Related Artists How we build it Similar to user recommendations, but with more models, not necessarily collaborative filtering based Implicit Matrix Factorization (shown previously) "Vector-Exp", similar model but probabilistic in nature, trained with gradient descent Google word2vec on playlists Echo Nest "cultural similarity" — based on scraping web pages about music! Query ANNs to generate candidates Score candidates from all models, combine and rank Pre-build table of 20 nearest artists to each artist
  35. 35. Radio
  36. 36. ML-wise, exactly the same as Related Artists! Radio For each track, generate candidates with ANN from each model Score w/ all models, rank with ensemble Store top 250 nearest neighbors in a database (Cassandra) User plays radio → load 250 tracks and shuffle Thumbs up → load more tracks from the thumbed-up song Thumbs down → remove that song / re-weight tracks
  37. 37. Upcoming work Deep learning based item similarity http://benanne.github.io/2014/08/05/spotify-cnns.html
  38. 38. Upcoming work Audio fingerprint based content deduplication ~1500 Echo Nest Musical Fingerprints per track based matching to accelerate all-pairs similarity Fast connected components using Hash-to-Min algorithm - mapreduce steps Min-Hash O(log d) http://arxiv.org/pdf/1203.5387.pdf
  39. 39. Thanks! I can be reached here: Andy Sloane Email: Twitter: Special thanks to , whose slides I plagiarized mercilessly andy@a1k0n.net @a1k0n http://a1k0n.net Erik Bernhardsson

×