SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Playlist Recommendations
@
Nikhil Tibrewal
@nikhil_tibrewal
Who am I?
Nikhil Tibrewal (Nick-hill)
● Data Engineer on Lambda squad (Spotify’s primary ML team)
● Graduated from Carnegie Mellon University in Dec 2013
● B.Sc. in Computer Science + additional major in Econ
● Been part of Spotify band for ~1.5 years
● Worked on a range of projects, primarily Playlist Recommendations
Spotify in numbers
● Started in 2006, 58 markets
● 75M+ active users, 20M+ paying
● 30M+ songs, 20K new per day
● 1.5+ billion playlists
● 1 TB logs per day
● Discover tab
● Radio
● Related Artists
● Discover Weekly
● Playlist recs on “Now” Strip
Recommendations so far on Spotify
For Ellie Goulding
“Now” Strip
Human
curated
playlist
“Now” Strip
Human
curated
playlist
Recommended
playlist
But…
How are playlist recs generated?
Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
Good
Quick Overview!
● Recommend only human
curated playlists (1000+)
○ Well-designed cover images
○ Thorough descriptions
○ Title reflects content
Good Bad
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
ANNOY (Approximate Nearest Neighbors Oh Yeah)
created at Spotify
https://github.com/spotify/annoy
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
○ Vectorize user taste as well:
■ User vector derived from user listening history
Quick Overview!
● Recommendations pipeline: Candidate Generation
○ Generate N dimensional track vectors from collaborative filtering
○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
○ Use Annoy to store playlist vectors in N dimensional space
○ Vectorize user taste as well:
■ User vector derived from user listening history
○ User and playlist vectors in same space!
○ Query for nearest playlists to user from Annoy tree
annoyTree.getNearest(seedVector, K)
Quick Overview!
● Recommendations pipeline: Ranking Model
○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations
■ John: 21, USA, likes rock
■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds
○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
Quick Overview!
● Recommendations pipeline: Ranking Model
○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations
■ John: 21, USA, likes rock
■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds
○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
90% DAUs have recs!
Quick Overview!
● Infrastructure
○ Luigi to manage workflow (also built at Spotify)
○ Entire pipeline written in Scalding
○ 1200+ nodes Hadoop cluster to run jobs
○ Cassandra (~dozen nodes for playlist recs)
○ Java backend micro-services serving recs
Quick Overview!
"Scalding is comprised of a DSL (domain-specific language)
that makes MapReduce computations look like Scala’s
collection API and is a wrapper for Cascading to make it easy
to define jobs, test and data sources on an HDFS" (http:
//cascading.io/customer/twitter/)
Scalding w.r.t. Playlist Recs
● Used Python back in the day
○ Inputs and outputs were tab separated
○ Complexity UP => Difficulty to maintain UP
○ Hard to write tests
● Scalding provided compile time error checks
○ Catch errors early
○ Define schemas (e.g. Avro)
● Can use Parquet + Avro for input/output
○ Easy to write and read data
○ Records with a lot of fields!
○ Lesson: Parquet hurts performance w/ fat columns (nested data structs)
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Data quality
○ Hadoop counters wrappers in extended Scalding library code
+
Scalding w.r.t. Playlist Recs
● Data quality
○ Hadoop counters wrappers in extended Scalding library code
○ Verify counters within reasonable ranges
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Pipeline tolerance
○ Job failures are normal, and annoying with big jobs
○ Scalding checkpoints
○ Lesson: checkpoint itself is a map-reduce job and has the same caveats
○ Still very helpful!
+
Scalding w.r.t. Playlist Recs
● Job runtimes
○ Common solutions: more reducers and code optimizations
○ Speculative execution for larger jobs
○ Caveat: can take up unnecessary resources
+
Scalding w.r.t. Playlist Recs
● Memory issues
○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with
infrequent large bulk inserts”
■ Replicated to all mappers
○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
+
https://github.com/spotify/sparkey
Scalding w.r.t. Playlist Recs
● Memory issues
○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with
infrequent large bulk inserts”
■ Replicated to all mappers
○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
○ Lesson: trade memory resources for MAYBE a little more time with joins
+
bigPipe.join(exSparkeyPipe)
https://github.com/spotify/sparkey
Scalding w.r.t. Playlist Recs
● Driven
○ “A sophisticated tool that collects telemetry data from running Scalding /
Cascading jobs on a cluster and presenting them in an intriguing User
Interface."
○ http://cascading.io/
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Other awesome benefits
+
Scalding w.r.t. Playlist Recs
● Other awesome benefits
○ Active community + big players
+
Scalding w.r.t. Playlist Recs
● Other awesome benefits
○ Active community + big players
○ Data pipeline flows naturally follow the functional paradigm - essentially
writing Scala code
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
Productivity without sacrificing performance!
+
Status: Completed
Spotify is hiring!
Nikhil Tibrewal
@nikhil_tibrewal

Contenu connexe

Tendances

Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender Systems
Roelof van Zwol
 

Tendances (20)

Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Netflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelNetflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time Travel
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
Machine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make SchoolMachine learning for Netflix recommendations talk at SF Make School
Machine learning for Netflix recommendations talk at SF Make School
 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender Systems
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
Fundamentals of Deep Recommender Systems
 Fundamentals of Deep Recommender Systems Fundamentals of Deep Recommender Systems
Fundamentals of Deep Recommender Systems
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at Spotify
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 

En vedette (6)

Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Music survey results (2)
Music survey results (2)Music survey results (2)
Music survey results (2)
 
Music & interaction
Music & interactionMusic & interaction
Music & interaction
 
Jackdaw research music survey report
Jackdaw research music survey reportJackdaw research music survey report
Jackdaw research music survey report
 
How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015
 
Mugo one pager
Mugo one pagerMugo one pager
Mugo one pager
 

Similaire à Playlist Recommendations @ Spotify

Similaire à Playlist Recommendations @ Spotify (20)

Spotify cassandra london
Spotify cassandra londonSpotify cassandra london
Spotify cassandra london
 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
 
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
 
Ontology Access Kit_ Workshop Intro Slides.pptx
Ontology Access Kit_ Workshop Intro Slides.pptxOntology Access Kit_ Workshop Intro Slides.pptx
Ontology Access Kit_ Workshop Intro Slides.pptx
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Cassandra nyc
Cassandra nycCassandra nyc
Cassandra nyc
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Data at Spotify
Data at SpotifyData at Spotify
Data at Spotify
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Sound soft hackday-100905
Sound soft hackday-100905Sound soft hackday-100905
Sound soft hackday-100905
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Recommendations 101
Recommendations 101 Recommendations 101
Recommendations 101
 
GDSC NYCU | 如何建立自己的開源專案
 GDSC NYCU | 如何建立自己的開源專案 GDSC NYCU | 如何建立自己的開源專案
GDSC NYCU | 如何建立自己的開源專案
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
Clouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
Clouds are Not Free: Guide to Observability-Driven Efficiency OptimizationsClouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
Clouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
SoundSoftware.ac.uk: Sustainable software for audio and music research (DMRN 5+)
 

Dernier

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Dernier (20)

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 

Playlist Recommendations @ Spotify

  • 2. Who am I? Nikhil Tibrewal (Nick-hill) ● Data Engineer on Lambda squad (Spotify’s primary ML team) ● Graduated from Carnegie Mellon University in Dec 2013 ● B.Sc. in Computer Science + additional major in Econ ● Been part of Spotify band for ~1.5 years ● Worked on a range of projects, primarily Playlist Recommendations
  • 3. Spotify in numbers ● Started in 2006, 58 markets ● 75M+ active users, 20M+ paying ● 30M+ songs, 20K new per day ● 1.5+ billion playlists ● 1 TB logs per day
  • 4. ● Discover tab ● Radio ● Related Artists ● Discover Weekly ● Playlist recs on “Now” Strip Recommendations so far on Spotify For Ellie Goulding
  • 7. But… How are playlist recs generated?
  • 8. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content
  • 9. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content Good
  • 10. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content Good Bad
  • 11. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering
  • 12. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist
  • 13. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ANNOY (Approximate Nearest Neighbors Oh Yeah) created at Spotify https://github.com/spotify/annoy
  • 14. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ○ Vectorize user taste as well: ■ User vector derived from user listening history
  • 15. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ○ Vectorize user taste as well: ■ User vector derived from user listening history ○ User and playlist vectors in same space! ○ Query for nearest playlists to user from Annoy tree annoyTree.getNearest(seedVector, K)
  • 16. Quick Overview! ● Recommendations pipeline: Ranking Model ○ Use genre information, demographics data, and playlist popularity data to further rank recommendations ■ John: 21, USA, likes rock ■ Should get rock playlist recs that are popular in USA and amongst 21 year olds ○ Apply post-processing steps for shuffling and add variety to avoid repetitions
  • 17. Quick Overview! ● Recommendations pipeline: Ranking Model ○ Use genre information, demographics data, and playlist popularity data to further rank recommendations ■ John: 21, USA, likes rock ■ Should get rock playlist recs that are popular in USA and amongst 21 year olds ○ Apply post-processing steps for shuffling and add variety to avoid repetitions 90% DAUs have recs!
  • 18. Quick Overview! ● Infrastructure ○ Luigi to manage workflow (also built at Spotify) ○ Entire pipeline written in Scalding ○ 1200+ nodes Hadoop cluster to run jobs ○ Cassandra (~dozen nodes for playlist recs) ○ Java backend micro-services serving recs
  • 19. Quick Overview! "Scalding is comprised of a DSL (domain-specific language) that makes MapReduce computations look like Scala’s collection API and is a wrapper for Cascading to make it easy to define jobs, test and data sources on an HDFS" (http: //cascading.io/customer/twitter/)
  • 20. Scalding w.r.t. Playlist Recs ● Used Python back in the day ○ Inputs and outputs were tab separated ○ Complexity UP => Difficulty to maintain UP ○ Hard to write tests ● Scalding provided compile time error checks ○ Catch errors early ○ Define schemas (e.g. Avro) ● Can use Parquet + Avro for input/output ○ Easy to write and read data ○ Records with a lot of fields! ○ Lesson: Parquet hurts performance w/ fat columns (nested data structs) +
  • 22. Scalding w.r.t. Playlist Recs ● Data quality ○ Hadoop counters wrappers in extended Scalding library code +
  • 23. Scalding w.r.t. Playlist Recs ● Data quality ○ Hadoop counters wrappers in extended Scalding library code ○ Verify counters within reasonable ranges +
  • 25. Scalding w.r.t. Playlist Recs ● Pipeline tolerance ○ Job failures are normal, and annoying with big jobs ○ Scalding checkpoints ○ Lesson: checkpoint itself is a map-reduce job and has the same caveats ○ Still very helpful! +
  • 26. Scalding w.r.t. Playlist Recs ● Job runtimes ○ Common solutions: more reducers and code optimizations ○ Speculative execution for larger jobs ○ Caveat: can take up unnecessary resources +
  • 27. Scalding w.r.t. Playlist Recs ● Memory issues ○ Used Sparkey indices in Python (developed at Spotify, now open source) ■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts” ■ Replicated to all mappers ○ Complex jobs in Scalding => higher memory config for jobs with Sparkey + https://github.com/spotify/sparkey
  • 28. Scalding w.r.t. Playlist Recs ● Memory issues ○ Used Sparkey indices in Python (developed at Spotify, now open source) ■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts” ■ Replicated to all mappers ○ Complex jobs in Scalding => higher memory config for jobs with Sparkey ○ Lesson: trade memory resources for MAYBE a little more time with joins + bigPipe.join(exSparkeyPipe) https://github.com/spotify/sparkey
  • 29. Scalding w.r.t. Playlist Recs ● Driven ○ “A sophisticated tool that collects telemetry data from running Scalding / Cascading jobs on a cluster and presenting them in an intriguing User Interface." ○ http://cascading.io/ +
  • 31. Scalding w.r.t. Playlist Recs ● Other awesome benefits +
  • 32. Scalding w.r.t. Playlist Recs ● Other awesome benefits ○ Active community + big players +
  • 33. Scalding w.r.t. Playlist Recs ● Other awesome benefits ○ Active community + big players ○ Data pipeline flows naturally follow the functional paradigm - essentially writing Scala code +
  • 35. Scalding w.r.t. Playlist Recs Productivity without sacrificing performance! +
  • 36. Status: Completed Spotify is hiring! Nikhil Tibrewal @nikhil_tibrewal