Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Oscar Carlsson
Data Engineer
lad@spotify.com
Big Data
and
Machine Learning
@ Spotify
Friday 6/3 2015
● D-student starting 2009
● Graduated last year from CSALL
(Student in this class 2013)
● Master thesis at Spotify
● Data ...
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Supervised learning:
data (X), labels (Y)
Unsupervised learning:
data (X)
In the Machine Learning class:
What is data at Spotify?
Songs Track
Metadata
User generated Users Playlists
Cover arts Listens Country, email etc Tracks ...
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Big Data and processing it
● 20 TB compressed data / DAY
○ 200 TB generated and stored / day (replication)
● Our business ...
Big Data and processing it
● Distributed computing and storage
○ Hadoop
■ MapReduce
○ Cassandra
● Hadoop cluster
○ 1100 no...
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Using data at Spotify
Everyone part of the company is interested in our data
● Product
○ Are people using X? Should we foc...
Using data at Spotify
● Data-driven decision making
○ Like.. every decision.
○ Analysts / Data scientists
● A/B test every...
Using data at Spotify: A/B testing
Objective: Decrease time from loading playlist to first play
Hypothesis: The bigger but...
Using data at Spotify: A/B testing
CONTROL A B
Analytics: A/B testing
Metric:
Share of users playing first play > 500ms
(500ms is made up)
Lets roll out A to all users a...
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
● Machine Learning
○ User analysis
○ Artist disambiguation
○ Recommender systems
Outline
“ A music session
somehow represents
a moment for the
user. Can we find
these moments and
describe them? ”
● Take a subset of user listening data with new genre
data
○ Combine listens in sessions
■ Consequent plays, no 15 min pau...
Machine Learning: Cluster user music sessions
K-Means Per cluster classification
Machine Learning: Cluster user music sessions
Per cluster logistic regression
w: weight vector
Each w_i can be interpreted...
Machine Learning: Cluster user music sessions
Clusters described by logistic regression
name of x_i
at largest
w_i
Machine Learning: Cluster user music sessions
Machine Learning: Cluster user music sessions
Machine Learning
Artist disambiguation
Cleaning up the artists pages
Machine Learning: Artist disambiguation
Machine Learning: Artist disambiguation
Lets listen to those tracks!
Is it really the same Fredrik?
Machine Learning: Artist disambiguation
Machine Learning: Artist disambiguation
● Rank artists with probability of being ambiguous
● Apply clustering on each “amb...
Machine Learning: Recommender system
The discover page
Machine Learning: Recommender system
Collaborative filtering
Machine Learning: Recommender system
Collaborative filtering
● Build a matrix of user plays
● Compute similarity between i...
Machine Learning: Recommender system
4 Million tracks x 60 Million users
→ Pairwise similarity infeasible
Approximate the ...
Machine Learning: Recommender system
Matrix factorization (latent factor models)
Machine Learning: Recommender system
Small vectors
Cosine similarity and dot product efficient
Machine Learning: Recommender system
Finding recommendations:
Approximate nearest neighbour (ANN)
code: https://github.com...
Machine Learning: Recommender system
I just went through this quickly, read more details of
Spotify Rec sys here:
Doing th...
● More content-based ML
○ Fingerprinting: Echo nest
○ Content-based music recommendation using
convolutional neural networ...
Summary
● Multiple data sources -> multiple angles
● Data drives decision with A/B testing
● User analysis
○ Cluster and d...
● We supervise thesis workers
○ Artist disambiguation/deduplication
○ Cluster user music sessions
○ Context-based recommen...
Oscar Carlsson
lad@spotify.com
Linkedin
Thank you for
listening!
Prochain SlideShare
Chargement dans…5
×

Big data and machine learning @ Spotify

3 125 vues

Publié le

Presented at the Machine Learning class at Chalmers, Gothenburg.
http://www.cse.chalmers.se/research/lab/courses.php?coid=9

Trying to connect their theoretical machine learning class with industry examples.

Publié dans : Données & analyses
  • Soyez le premier à commenter

Big data and machine learning @ Spotify

  1. 1. Oscar Carlsson Data Engineer lad@spotify.com Big Data and Machine Learning @ Spotify Friday 6/3 2015
  2. 2. ● D-student starting 2009 ● Graduated last year from CSALL (Student in this class 2013) ● Master thesis at Spotify ● Data Engineer at Spotify in Gothenburg Me
  3. 3. ● What is data at Spotify? ● Big data and processing it ● Using data at Spotify ● Machine Learning Outline
  4. 4. Supervised learning: data (X), labels (Y) Unsupervised learning: data (X) In the Machine Learning class:
  5. 5. What is data at Spotify? Songs Track Metadata User generated Users Playlists Cover arts Listens Country, email etc Tracks of playlist Album Clicks Add/Removes Genres, Mood etc Page views 30 Million songs 60 Million Monthly Active Users 58 Markets 15 Million subscribers 1.5 Billion Playlists
  6. 6. ● What is data at Spotify? ● Big data and processing it ● Using data at Spotify ● Machine Learning Outline
  7. 7. Big Data and processing it ● 20 TB compressed data / DAY ○ 200 TB generated and stored / day (replication) ● Our business is highly dependent on these logs ○ We pay artist depending on plays, plays = logs Too much to store on a single computer. We need a cluster to process it! .. this is typically what is called “Big Data”
  8. 8. Big Data and processing it ● Distributed computing and storage ○ Hadoop ■ MapReduce ○ Cassandra ● Hadoop cluster ○ 1100 nodes ○ ~8000 jobs/day
  9. 9. ● What is data at Spotify? ● Big data and processing it ● Using data at Spotify ● Machine Learning Outline
  10. 10. Using data at Spotify Everyone part of the company is interested in our data ● Product ○ Are people using X? Should we focus on features such as Y? ● Insights ○ What music is trending? What artists is popular where? ● Performance ○ How is latency in country Y? Did this reduce stutter in country X?
  11. 11. Using data at Spotify ● Data-driven decision making ○ Like.. every decision. ○ Analysts / Data scientists ● A/B test everything! ● A/B testing: ○ Statistical hypothesis testing ○ Simple randomized experiment with >= 2 variants (A, B)
  12. 12. Using data at Spotify: A/B testing Objective: Decrease time from loading playlist to first play Hypothesis: The bigger button the faster users finds it Test set up: ● A - variant 1 ○ 2% US and SE MAU users ● B - variant 2 ○ 2% US and SE MAU users ● Control - normal ○ Rest of users in US SE “The shuffle button”
  13. 13. Using data at Spotify: A/B testing CONTROL A B
  14. 14. Analytics: A/B testing Metric: Share of users playing first play > 500ms (500ms is made up) Lets roll out A to all users and throw away B!
  15. 15. ● What is data at Spotify? ● Big data and processing it ● Using data at Spotify ● Machine Learning Outline
  16. 16. ● Machine Learning ○ User analysis ○ Artist disambiguation ○ Recommender systems Outline
  17. 17. “ A music session somehow represents a moment for the user. Can we find these moments and describe them? ”
  18. 18. ● Take a subset of user listening data with new genre data ○ Combine listens in sessions ■ Consequent plays, no 15 min pause ○ Session = [genres] ● Clustering algorithms to find similar sessions ○ K-means / Hierarchical clustering ● Describe the clusters using logistic regression Machine Learning: Cluster user music sessions
  19. 19. Machine Learning: Cluster user music sessions K-Means Per cluster classification
  20. 20. Machine Learning: Cluster user music sessions Per cluster logistic regression w: weight vector Each w_i can be interpreted as the effect in the x_i variable x_i = genres
  21. 21. Machine Learning: Cluster user music sessions Clusters described by logistic regression name of x_i at largest w_i
  22. 22. Machine Learning: Cluster user music sessions
  23. 23. Machine Learning: Cluster user music sessions
  24. 24. Machine Learning Artist disambiguation Cleaning up the artists pages
  25. 25. Machine Learning: Artist disambiguation
  26. 26. Machine Learning: Artist disambiguation Lets listen to those tracks! Is it really the same Fredrik?
  27. 27. Machine Learning: Artist disambiguation
  28. 28. Machine Learning: Artist disambiguation ● Rank artists with probability of being ambiguous ● Apply clustering on each “ambiguous” artists albums/tracks ○ Using features such as country, release year, label/licensor etc. ○ Distinct cluster could be different artists ● Nicely present this for manual curation
  29. 29. Machine Learning: Recommender system The discover page
  30. 30. Machine Learning: Recommender system Collaborative filtering
  31. 31. Machine Learning: Recommender system Collaborative filtering ● Build a matrix of user plays ● Compute similarity between items
  32. 32. Machine Learning: Recommender system 4 Million tracks x 60 Million users → Pairwise similarity infeasible Approximate the matrix with NMF
  33. 33. Machine Learning: Recommender system Matrix factorization (latent factor models)
  34. 34. Machine Learning: Recommender system Small vectors Cosine similarity and dot product efficient
  35. 35. Machine Learning: Recommender system Finding recommendations: Approximate nearest neighbour (ANN) code: https://github.com/spotify/annoy Related artists & Radio: Similar to user recommendations, more models and not all CF-based Multiple models: Score candidates from all models, combine and rank!
  36. 36. Machine Learning: Recommender system I just went through this quickly, read more details of Spotify Rec sys here: Doing this on MapReduce Comparing with Netflix Music Rec @ MLConf 2014
  37. 37. ● More content-based ML ○ Fingerprinting: Echo nest ○ Content-based music recommendation using convolutional neural networks ● Personalize everything ○ Emails ○ Ads ○ User profiling ● ML on other parts of product than Rec Sys .. final last words on the Future of ML at Spotify
  38. 38. Summary ● Multiple data sources -> multiple angles ● Data drives decision with A/B testing ● User analysis ○ Cluster and describe with classifier ● Artist disambiguation ○ Cluster and give to manual curators ● Recommender systems ○ Collaborative filtering
  39. 39. ● We supervise thesis workers ○ Artist disambiguation/deduplication ○ Cluster user music sessions ○ Context-based recommender systems ○ Personalized ads / Personalized emails ● We have internships! www.spotify.com/jobs .. and potentially you could help us?
  40. 40. Oscar Carlsson lad@spotify.com Linkedin Thank you for listening!

×