Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Scala Data
Pipelines @
Spotify
Neville Li
@sinisa_lyh
Who am I?
‣ SpotifyNYCsince2011
‣ FormerlyYahoo!Search
‣ Musicrecommendations
‣ Datainfrastructure
‣ Scalasince2013
Spotify in numbers
• Started in 2006, 58 markets
• 75M+ active users, 20M+ paying
• 30M+ songs, 20K new per day
• 1.5 bill...
Music recommendation @ Spotify
• Discover Weekly
• Radio
• RelatedArtists
• Discover Page
Recommendation systems
A little teaser
PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
CombineFn<K,V> reduceFn)
Crunch: CombineFns ar...
Monoid!
enables map side reduce
Actually it’s a semigroup
One more teaser
Linear equation inAlternate Least Square (ALS) Matrix factorization
xu = (YTY + YT(Cu − I)Y)−1YTCup(u)
vec...
Success story
• Mid 2013: 100+ Python Luigi M/R jobs, few tests
• 10+ new hires since, most fresh grads
• Few with Java ex...
First 10 months
……
Activity over time
Guess how many jobs
written by yours truly?
Performance vs. Agility
https://nicholassterling.wordpress.com/2012/11/16/scala-performance/
Let’sdiveinto
something
technical
To join or not to join?
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)]...
Hash join
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track...
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, ...
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, ...
Key-value file as distributed cache
val streams: TypedPipe[(String, String)] = _ // (gid, user)
val tgp: SparkeyManager = ...
Joins and CoGroups
• Require shuffle and reduce step
• Some ops force everything to reducers

e.g. mapGroup, mapValueStrea...
Distributed cache
• Fasterwith off-heap binary files
• Building cache = more wiring
• Memory mapping may interfere withYAR...
Analyze your jobs
• Concurrent Driven
• Visualize job execution
• Workflow optimization
• Bottlenecks
• Data skew
Notenough
math?
Recommending tracks
• User listened to Rammstein - Du Hast
• Recommend 10 similartracks
• 40 dimension feature vectors for...
ANNOY - cheat by approximation
• Approximate Nearest Neighbor OhYeah
• Random projections and binarytree search
• Build in...
ANN Benchmark
https://github.com/erikbern/ann-benchmarks
Filtering candidates
• Users don’t like seeing artist/album/tracks they already know
• But may forget what they listened l...
Options
• Aggregate all logs daily
• Aggregate last x days daily
• CSVof artist/album/track ids
• Bloom filters
Decayed value with cutoff
• Compute new user-item score daily
• Weighted on context, e.g. radio, search, playlist
• score’...
Bloom filters
• Probabilistic data structure
• Encoding set of items with m bits and k hash functions
• No false negative
...
Size versus max items & FP prob
• User-item distribution is uneven
• Assuming same setting for all users
• # items << capa...
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users hav...
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users hav...
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users hav...
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users hav...
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users hav...
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smal...
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smal...
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Monoid! enables map side reduce
Prochain SlideShare
Chargement dans…5
×

Monoid! enables map side reduce Scala Data Pipelines @ Spotify

35 013 vues

Publié le

Monoid!
enables map side reduce
Actually it’s a semigroup

Publié dans : Logiciels
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • great presentation, awesome success story for Scala + Spark + judicious use of algebraic structures .. Functional Programming FTW ..
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Great presentation - just caught it on Daily Muse!
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

×