Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Data at Spotify

Data infrastructure at Spotify

Data at Spotify

  1. 1. June 12, 2014 Danielle Jabin Data Engineer, A/B Testing Data at Spotify
  2. 2. I’m Danielle Jabin •  Data Engineer in the Stockholm office •  A/B testing infrastructure •  California born & raised •  If I can survive a Swedish winter, so can you! •  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania
  3. 3. 3 Over 40 million active users As of June 9, 2014  
  4. 4. 4 Access to more than 20 million songs As of June 9, 2014  
  5. 5. Big Data •  40 million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including replication factor of 3) As of June 9, 2014  
  6. 6. 6 So how much data is that?
  7. 7. Let’s compare: 64 TB •  293, 203, 072 books (200 pages or 240,000 characters) •  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)
  8. 8. 8 That’s a lot of selfies
  9. 9. 9 How do we use this data?
  10. 10. Use Cases •  Reporting •  Business Analytics •  Operational Analytics •  Product Features
  11. 11. Reporting •  Reporting to labels, licensors, partners, and advertisers •  We support our partners
  12. 12. Business Analytics •  Analyzing growth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis
  13. 13. Operational Metrics •  Root cause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)
  14. 14. Product Features •  Discover and Radio •  Top lists •  Personalized recommendations •  A/B Testing
  15. 15. 15 How do we collect this data?
  16. 16. The three pillars of our Data Infrastructure: Kafka Collection Hadoop Processing Databases Analytics/Visualization
  17. 17. This is Dave. Data Engineer at Spotify by day…
  18. 18. …chiptune DJ Demoscene Time Machine by night.
  19. 19. Let’s listen to Dave’s song
  20. 20. Kafka •  High volume pub-sub system •  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”
  21. 21. Kafka •  Robust and scalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance •  Consumers can pull data at different rates •  Able to handle extremely high volumes
  22. 22. Other people listened too!
  23. 23. Hadoop •  Process and store massive amounts of unstructured data across a distributed cluster •  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe
  24. 24. Hadoop •  Entering the land of optimizations •  Data retention policy •  Move to JVM-based languages •  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala •  Sprunch: Crunch wrapper for Scala, open sourced by Spotify •  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs
  25. 25. What if we want to know more? vs
  26. 26. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Core data can be used and manipulated for various needs •  Ad hoc queries •  Dashboards
  27. 27. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards
  28. 28. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards
  29. 29. Questions?
  30. 30. A/B testing questions? Find me! Contr ol vs
  31. 31. Thank you!

    Soyez le premier à commenter

    Identifiez-vous pour voir les commentaires

  • krzysztofzarzycki98

    Jun. 19, 2014
  • radiantslide

    Jul. 12, 2014
  • dkfelipe

    Aug. 1, 2014
  • manat

    Dec. 15, 2014
  • annateoh

    Jan. 27, 2015
  • andreyevbr1

    Mar. 25, 2015
  • jimcoly

    Apr. 5, 2015
  • justinleeschmidtmn

    Apr. 11, 2015
  • byoigres1

    May. 14, 2015
  • champagneliao

    May. 19, 2015
  • dannyeuu

    Aug. 10, 2015
  • tbdaly

    Aug. 18, 2015
  • stu73

    Sep. 7, 2015
  • SofiaBekatorou

    Sep. 7, 2015
  • gabgalvis

    Sep. 7, 2015
  • spiunno1

    Oct. 7, 2015
  • jeromepons

    May. 21, 2016
  • PhilippeGirolami

    Jan. 20, 2017
  • JenniferLyall

    Jan. 25, 2017
  • EfranAlonsoMuozVales

    Feb. 24, 2018

Data infrastructure at Spotify

Vues

Nombre de vues

17 234

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

8 354

Actions

Téléchargements

221

Partages

0

Commentaires

0

Mentions J'aime

22

×