SlideShare a Scribd company logo
1 of 70
Download to read offline
Shortening the
Feedback Loop
HowSpotify’sBigDataEcosystemHas
EvolvedtoLeverageActionableInsights
Josh Baer (jbx@spotify.com)
Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Spotify’s 2500 node
Hadoop cluster
@l_phant
• Spotify Launches
• Access to a gigantic catalog
of music
• Click to play instantaneous!
In 2008
Behind the Scenes:
Days to Insights
Behind the Scenes
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
To leverage actionable
insights, we need a
faster feedback loop!
• Music Streaming Service
• Launched in 2008
• Premium and FreeTiers
• Available in 60 Countries
What is Spotify?
Over 100 Million Active Users
Over 30 Million Songs
Over 1 Billion Plays Per Day
And we have Data
Hadoop at Spotify
• ~2,500 Nodes
• >100 PB Capacity
• >100 TB Memory accessible by jobs
• 20KJobs/Day
Apache Kafka at Spotify
• 500 Kafka-related machines
• 40TB/day from logs
Real-Time at Spotify
• StormTopologies fed via Kafka
• Mostly used for hack ideas or proof of concepts
Migratingto
theCloud
In the Beginning…
• Spotifywas almost completely on-premise/bare metal
• Grew to 2,500 node Hadoop cluster and over 10K
total machines in production at four globally
distributed data centers
• “Flirted” with cloud providers at various times
In 2014
• Maybe we should trythis cloud thing for real
Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs
and increasing in reliability and variety of service
offered
• Owning and operating physical machines is not a
competitive advantage for Spotify
Why Google’s Cloud?
• We believe Google’s industry leading background
in Big Data technologies will give us a data
processing advantage
Google
Cloud Data
Building Blocks
BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop and servers
• Leverages Google’s internal tech
• Dremel (query execution engine)
• Colossus (distributed storage)
• Borg (distributed compute)
• Jupiter (network)
Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
BigQuery vs. Hive
• Example Queries:
• What are the top 10 songs by popularity in Spain
during October 2016?
• How many hours did users in Spain spend
listening to Spotify during October?
BigQuery vs. Hive
• What are the top 10 songs by popularity in Spain during October 2016?
• Hive
• 2647s (44min, 7sec)
• 15.5TB processed
• BigQuery
• 108s (1min, 48sec)
• 1.50TB processed
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name
1 J	Balvin Safari
2 DJ	Snake Let	Me	Love	You
3 Ricky	Mar8n Vente	Pa'	Ca
4 Sebas8an	Yatra Traicionera
5 Zion	&	Lennox	(feat.	J	Balvin) Otra	Vez
6 Carlos	Vives,	Shakira La	Bicicleta
7 The	Chainsmokers Closer
8 Major	Lazer	(feat.	Jus8n	Bieber	&	MØ) Cold	Water
9 Sia The	Greatest
10 IAmChino	(feat.	Pitbull,	Yandel	&	Chacal) Ay	MI	Dios
BigQuery vs. Hive
• How much time did users in Spain spend listening to Spotify during October?
• Hive
• 969s (16min, 9 sec)
• 15.5TB processed
• BigQuery
• 33s
• 780 GB processed
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
Nearly 10,000 Years!
BigQuery at Spotify
• Interactive and ad-hoc querying immediately
started to transferto BQ once the data was
available on the cloud
• Pace of learning increases as friction to question
decreases
Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish
subscribe behavior
• Like Kafka, but without needing to operate servers
and supporting services (zookeeper)
Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from globally distributed non-GCP
datacenters is painless
• Managed Service for running batch and streaming jobs
• UnifiedAPI for batch and streaming mode
• Inspired by internal Google tools like FlumeJava and
Millwheel
• Programming model open-sourced asApache Beam
(currently incubating)
Cloud Dataflow
• Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scalaAPI for running Dataflow jobs
and provides easy integrations with BigQuery
• New batch processing jobs at Spotify are being
written in Scio/Dataflow
Cloud Dataflow (Batch) at Spotify
• Exactly-once stream processing framework
• Areplacement for Spark/Flink streaming and
Storm workloads at Spotify
• Optimizes for consistencywhich can complicate
real-time workloads
Cloud Dataflow (Streaming) at Spotify
Spotify + Google Cloud Timeline
2015 2016
Beginning of Google
Cloud evaluation
BigQuery begins
to replace Hive
Cloud Pub/Sub begins
to replace Kafka
Dataflow (streaming)
begins to replace Storm
Dataflow (batch)
replacing Map/Reduce
Note: Dates are approximations
Putting ItAll
Together
The Problem
• We want to detect within minutes ifwe’ve
introduced a bug in a client release that affects
important event logging behavior
Before…
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
Getting Data from Clients to Pub/Sub
• Built Pulsar, a simple service aggregating data from
Access Points and feeding it into Cloud Pub/Sub
• Replaces the Kafka real-time event feed
Pulsar
Dataflow
• Subscribes to important event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need to schedule orwait for
results
BigQuery
• Receives aggregates from Dataflow
• Allows for ad-hoc inspection or slicing on different
dimensions
Tableau
• DataVisualizationTool that integrates nicelywith
BigQuery
• Pulls data from BigQuery periodically and caches for
quick inspection
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
FasterInsights
toClient
Behavior
Problem
As a developer, I want to be able to instantly explore
data being logged bythe clients.
Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana
Benefits
• Able to understand what’s being sent bythe client
as it happens
• Exploring events, visualizing distribution (i.e. does
this field actually get populated)
• Prototyping analysis based on a sample
• Dashboards for Employee Releases
OtherUses
Ad Targeting
• Real-time genre targeting
• Session insights — explicit filter
Real-time Recommendations
Live Results for X-Factor
• X-Factor: music competition
• Songs available on Spotify
immediately after show airs
• Listener behavior determines the
order of contestants on the playlist
Review
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
To leverage actionable
insights, we need a
faster feedback loop!
Putting it all together
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
TheValue of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or
long term data loss
• Instant insights on newly developed features allows
teams to iterate quicker and take risks
• Providing a quicker ad-hoc querying engine allows teams
to ask more questions and learn faster
UseAnything and Everything
• Spotify has leveraged Google Cloud tools, such as Pub/
Sub, Dataflow and BigQuery
• Opensource and other cloud providers offer many
alternatives to this stack
• Opensource tools (Elasticsearch/Kibana) and proprietary
solutions (Tableau) have also been useful additions
WhereAre We Going?
• The real-time mission is in the early stages at
Spotify
Stream Processing First
• The sun never sets on Spotify, why impose
boundaries on our datasets?
• What’s the shortest distance between two points?
Zero!
• Can we reduce the feedback cycle to zero?
We’reHiring!
Engineers, Managers, Product Owners
needed in NYC and Stockholm
https://www.spotify.com/jobs
Thanks! BigDataSpain!
Shortening the feedback loop

More Related Content

What's hot

What's hot (20)

Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Spotify: Data center & Backend buildout
Spotify: Data center & Backend buildoutSpotify: Data center & Backend buildout
Spotify: Data center & Backend buildout
 
Productive data engineer
Productive data engineerProductive data engineer
Productive data engineer
 
Real time ads personalization @ Spotify
Real time ads personalization @ SpotifyReal time ads personalization @ Spotify
Real time ads personalization @ Spotify
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Spotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great SuccessSpotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great Success
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the Ugly
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 

Similar to Shortening the feedback loop

HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
Cloudera, Inc.
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
yhadoop
 
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
Lucas Jellema
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
Michael Kehoe
 
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersEnabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
DataWorks Summit
 

Similar to Shortening the feedback loop (20)

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Music streams
Music streamsMusic streams
Music streams
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Kubeflow at Spotify (For the Kubeflow Summit)
Kubeflow at Spotify (For the Kubeflow Summit)Kubeflow at Spotify (For the Kubeflow Summit)
Kubeflow at Spotify (For the Kubeflow Summit)
 
Apache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachApache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop Approach
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersEnabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Shortening the feedback loop

  • 1. Shortening the Feedback Loop HowSpotify’sBigDataEcosystemHas EvolvedtoLeverageActionableInsights Josh Baer (jbx@spotify.com) Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
  • 2. Who am I? • Technical Product Owner at Spotify • Working with fast processing infrastructure • Previously, building out Spotify’s 2500 node Hadoop cluster @l_phant
  • 3. • Spotify Launches • Access to a gigantic catalog of music • Click to play instantaneous! In 2008
  • 6. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  • 7. “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  • 8. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  • 9. To leverage actionable insights, we need a faster feedback loop!
  • 10. • Music Streaming Service • Launched in 2008 • Premium and FreeTiers • Available in 60 Countries What is Spotify?
  • 11. Over 100 Million Active Users
  • 13. Over 1 Billion Plays Per Day
  • 14. And we have Data
  • 15. Hadoop at Spotify • ~2,500 Nodes • >100 PB Capacity • >100 TB Memory accessible by jobs • 20KJobs/Day
  • 16. Apache Kafka at Spotify • 500 Kafka-related machines • 40TB/day from logs
  • 17. Real-Time at Spotify • StormTopologies fed via Kafka • Mostly used for hack ideas or proof of concepts
  • 19. In the Beginning… • Spotifywas almost completely on-premise/bare metal • Grew to 2,500 node Hadoop cluster and over 10K total machines in production at four globally distributed data centers • “Flirted” with cloud providers at various times
  • 20. In 2014 • Maybe we should trythis cloud thing for real
  • 21. Why Move to the Cloud? • Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered • Owning and operating physical machines is not a competitive advantage for Spotify
  • 22. Why Google’s Cloud? • We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage
  • 24. BigQuery • Ad-hoc and interactive querying service for massive datasets • Like Hive, but without needing to manage Hadoop and servers • Leverages Google’s internal tech • Dremel (query execution engine) • Colossus (distributed storage) • Borg (distributed compute) • Jupiter (network) Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
  • 25. BigQuery vs. Hive • Example Queries: • What are the top 10 songs by popularity in Spain during October 2016? • How many hours did users in Spain spend listening to Spotify during October?
  • 26. BigQuery vs. Hive • What are the top 10 songs by popularity in Spain during October 2016? • Hive • 2647s (44min, 7sec) • 15.5TB processed • BigQuery • 108s (1min, 48sec) • 1.50TB processed Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  • 27. Top 10 Tracks in Spain during October 2016 Rank Artist(s) Track Name 1 J Balvin Safari 2 DJ Snake Let Me Love You 3 Ricky Mar8n Vente Pa' Ca 4 Sebas8an Yatra Traicionera 5 Zion & Lennox (feat. J Balvin) Otra Vez 6 Carlos Vives, Shakira La Bicicleta 7 The Chainsmokers Closer 8 Major Lazer (feat. Jus8n Bieber & MØ) Cold Water 9 Sia The Greatest 10 IAmChino (feat. Pitbull, Yandel & Chacal) Ay MI Dios
  • 28. BigQuery vs. Hive • How much time did users in Spain spend listening to Spotify during October? • Hive • 969s (16min, 9 sec) • 15.5TB processed • BigQuery • 33s • 780 GB processed Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  • 30. BigQuery at Spotify • Interactive and ad-hoc querying immediately started to transferto BQ once the data was available on the cloud • Pace of learning increases as friction to question decreases
  • 31. Cloud Pub/Sub • At least once globally distributed message queue • For high volume, low topic (<10,000) publish subscribe behavior • Like Kafka, but without needing to operate servers and supporting services (zookeeper)
  • 32. Cloud Pub/Sub at Spotify • 800K events/second? No problem • P99 Latency of ingestions into ES: 500ms • Ingestion from globally distributed non-GCP datacenters is painless
  • 33. • Managed Service for running batch and streaming jobs • UnifiedAPI for batch and streaming mode • Inspired by internal Google tools like FlumeJava and Millwheel • Programming model open-sourced asApache Beam (currently incubating) Cloud Dataflow
  • 34. • Usually run via Scio: https://github.com/spotify/scio • Scio provides a scalaAPI for running Dataflow jobs and provides easy integrations with BigQuery • New batch processing jobs at Spotify are being written in Scio/Dataflow Cloud Dataflow (Batch) at Spotify
  • 35. • Exactly-once stream processing framework • Areplacement for Spark/Flink streaming and Storm workloads at Spotify • Optimizes for consistencywhich can complicate real-time workloads Cloud Dataflow (Streaming) at Spotify
  • 36.
  • 37. Spotify + Google Cloud Timeline 2015 2016 Beginning of Google Cloud evaluation BigQuery begins to replace Hive Cloud Pub/Sub begins to replace Kafka Dataflow (streaming) begins to replace Storm Dataflow (batch) replacing Map/Reduce Note: Dates are approximations
  • 39. The Problem • We want to detect within minutes ifwe’ve introduced a bug in a client release that affects important event logging behavior
  • 40. Before… Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  • 41. Getting Data from Clients to Pub/Sub • Built Pulsar, a simple service aggregating data from Access Points and feeding it into Cloud Pub/Sub • Replaces the Kafka real-time event feed
  • 43. Dataflow • Subscribes to important event Pub/Sub topics • Aggregate events into minute windows • Always running, no need to schedule orwait for results
  • 44. BigQuery • Receives aggregates from Dataflow • Allows for ad-hoc inspection or slicing on different dimensions
  • 45. Tableau • DataVisualizationTool that integrates nicelywith BigQuery • Pulls data from BigQuery periodically and caches for quick inspection
  • 46.
  • 48.
  • 50. Problem As a developer, I want to be able to instantly explore data being logged bythe clients.
  • 51. Solution • Produce a topic for all employee client events • Store in Elasticsearch • Visualize in Kibana
  • 52.
  • 53.
  • 54. Benefits • Able to understand what’s being sent bythe client as it happens • Exploring events, visualizing distribution (i.e. does this field actually get populated) • Prototyping analysis based on a sample • Dashboards for Employee Releases
  • 56. Ad Targeting • Real-time genre targeting • Session insights — explicit filter
  • 58. Live Results for X-Factor • X-Factor: music competition • Songs available on Spotify immediately after show airs • Listener behavior determines the order of contestants on the playlist
  • 60. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  • 61. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  • 62. To leverage actionable insights, we need a faster feedback loop!
  • 63. Putting it all together Milliseconds to transfer Milliseconds to process Seconds to Query SECONDS TO INSIGHTS
  • 64. TheValue of a Fast Feedback Loop • Detecting problems early in data avoids long backfills or long term data loss • Instant insights on newly developed features allows teams to iterate quicker and take risks • Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster
  • 65. UseAnything and Everything • Spotify has leveraged Google Cloud tools, such as Pub/ Sub, Dataflow and BigQuery • Opensource and other cloud providers offer many alternatives to this stack • Opensource tools (Elasticsearch/Kibana) and proprietary solutions (Tableau) have also been useful additions
  • 66. WhereAre We Going? • The real-time mission is in the early stages at Spotify
  • 67. Stream Processing First • The sun never sets on Spotify, why impose boundaries on our datasets? • What’s the shortest distance between two points? Zero! • Can we reduce the feedback cycle to zero?
  • 68. We’reHiring! Engineers, Managers, Product Owners needed in NYC and Stockholm https://www.spotify.com/jobs