SlideShare a Scribd company logo
1 of 70
Download to read offline
Shortening the
Feedback Loop
HowSpotify’sBigDataEcosystemHas
EvolvedtoLeverageActionableInsights
Josh Baer (jbx@spotify.com)
Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Spotify’s 2500 node
Hadoop cluster
@l_phant
• Spotify Launches
• Access to a gigantic catalog
of music
• Click to play instantaneous!
In 2008
Behind the Scenes:
Days to Insights
Behind the Scenes
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
To leverage actionable
insights, we need a
faster feedback loop!
• Music Streaming Service
• Launched in 2008
• Premium and FreeTiers
• Available in 60 Countries
What is Spotify?
Over 100 Million Active Users
Over 30 Million Songs
Over 1 Billion Plays Per Day
And we have Data
Hadoop at Spotify
• ~2,500 Nodes
• >100 PB Capacity
• >100 TB Memory accessible by jobs
• 20KJobs/Day
Apache Kafka at Spotify
• 500 Kafka-related machines
• 40TB/day from logs
Real-Time at Spotify
• StormTopologies fed via Kafka
• Mostly used for hack ideas or proof of concepts
Migratingto
theCloud
In the Beginning…
• Spotifywas almost completely on-premise/bare metal
• Grew to 2,500 node Hadoop cluster and over 10K
total machines in production at four globally
distributed data centers
• “Flirted” with cloud providers at various times
In 2014
• Maybe we should trythis cloud thing for real
Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs
and increasing in reliability and variety of service
offered
• Owning and operating physical machines is not a
competitive advantage for Spotify
Why Google’s Cloud?
• We believe Google’s industry leading background
in Big Data technologies will give us a data
processing advantage
Google
Cloud Data
Building Blocks
BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop and servers
• Leverages Google’s internal tech
• Dremel (query execution engine)
• Colossus (distributed storage)
• Borg (distributed compute)
• Jupiter (network)
Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
BigQuery vs. Hive
• Example Queries:
• What are the top 10 songs by popularity in Spain
during October 2016?
• How many hours did users in Spain spend
listening to Spotify during October?
BigQuery vs. Hive
• What are the top 10 songs by popularity in Spain during October 2016?
• Hive
• 2647s (44min, 7sec)
• 15.5TB processed
• BigQuery
• 108s (1min, 48sec)
• 1.50TB processed
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name
1 J	Balvin Safari
2 DJ	Snake Let	Me	Love	You
3 Ricky	Mar8n Vente	Pa'	Ca
4 Sebas8an	Yatra Traicionera
5 Zion	&	Lennox	(feat.	J	Balvin) Otra	Vez
6 Carlos	Vives,	Shakira La	Bicicleta
7 The	Chainsmokers Closer
8 Major	Lazer	(feat.	Jus8n	Bieber	&	MØ) Cold	Water
9 Sia The	Greatest
10 IAmChino	(feat.	Pitbull,	Yandel	&	Chacal) Ay	MI	Dios
BigQuery vs. Hive
• How much time did users in Spain spend listening to Spotify during October?
• Hive
• 969s (16min, 9 sec)
• 15.5TB processed
• BigQuery
• 33s
• 780 GB processed
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
Nearly 10,000 Years!
BigQuery at Spotify
• Interactive and ad-hoc querying immediately
started to transferto BQ once the data was
available on the cloud
• Pace of learning increases as friction to question
decreases
Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish
subscribe behavior
• Like Kafka, but without needing to operate servers
and supporting services (zookeeper)
Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from globally distributed non-GCP
datacenters is painless
• Managed Service for running batch and streaming jobs
• UnifiedAPI for batch and streaming mode
• Inspired by internal Google tools like FlumeJava and
Millwheel
• Programming model open-sourced asApache Beam
(currently incubating)
Cloud Dataflow
• Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scalaAPI for running Dataflow jobs
and provides easy integrations with BigQuery
• New batch processing jobs at Spotify are being
written in Scio/Dataflow
Cloud Dataflow (Batch) at Spotify
• Exactly-once stream processing framework
• Areplacement for Spark/Flink streaming and
Storm workloads at Spotify
• Optimizes for consistencywhich can complicate
real-time workloads
Cloud Dataflow (Streaming) at Spotify
Spotify + Google Cloud Timeline
2015 2016
Beginning of Google
Cloud evaluation
BigQuery begins
to replace Hive
Cloud Pub/Sub begins
to replace Kafka
Dataflow (streaming)
begins to replace Storm
Dataflow (batch)
replacing Map/Reduce
Note: Dates are approximations
Putting ItAll
Together
The Problem
• We want to detect within minutes ifwe’ve
introduced a bug in a client release that affects
important event logging behavior
Before…
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
Getting Data from Clients to Pub/Sub
• Built Pulsar, a simple service aggregating data from
Access Points and feeding it into Cloud Pub/Sub
• Replaces the Kafka real-time event feed
Pulsar
Dataflow
• Subscribes to important event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need to schedule orwait for
results
BigQuery
• Receives aggregates from Dataflow
• Allows for ad-hoc inspection or slicing on different
dimensions
Tableau
• DataVisualizationTool that integrates nicelywith
BigQuery
• Pulls data from BigQuery periodically and caches for
quick inspection
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
FasterInsights
toClient
Behavior
Problem
As a developer, I want to be able to instantly explore
data being logged bythe clients.
Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana
Benefits
• Able to understand what’s being sent bythe client
as it happens
• Exploring events, visualizing distribution (i.e. does
this field actually get populated)
• Prototyping analysis based on a sample
• Dashboards for Employee Releases
OtherUses
Ad Targeting
• Real-time genre targeting
• Session insights — explicit filter
Real-time Recommendations
Live Results for X-Factor
• X-Factor: music competition
• Songs available on Spotify
immediately after show airs
• Listener behavior determines the
order of contestants on the playlist
Review
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
DAYS TO
INSIGHTS
To leverage actionable
insights, we need a
faster feedback loop!
Putting it all together
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
TheValue of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or
long term data loss
• Instant insights on newly developed features allows
teams to iterate quicker and take risks
• Providing a quicker ad-hoc querying engine allows teams
to ask more questions and learn faster
UseAnything and Everything
• Spotify has leveraged Google Cloud tools, such as Pub/
Sub, Dataflow and BigQuery
• Opensource and other cloud providers offer many
alternatives to this stack
• Opensource tools (Elasticsearch/Kibana) and proprietary
solutions (Tableau) have also been useful additions
WhereAre We Going?
• The real-time mission is in the early stages at
Spotify
Stream Processing First
• The sun never sets on Spotify, why impose
boundaries on our datasets?
• What’s the shortest distance between two points?
Zero!
• Can we reduce the feedback cycle to zero?
We’reHiring!
Engineers, Managers, Product Owners
needed in NYC and Stockholm
https://www.spotify.com/jobs
Thanks! BigDataSpain!
Shortening the feedback loop

More Related Content

What's hot

Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyNeville Li
 
Productive data engineer
Productive data engineerProductive data engineer
Productive data engineerRafał Wojdyła
 
Real time ads personalization @ Spotify
Real time ads personalization @ SpotifyReal time ads personalization @ Spotify
Real time ads personalization @ SpotifyKinshuk Mishra
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
 
Spotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great SuccessSpotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great SuccessNick Barkas
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big DataMiguel Pastor
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormJungtaek Lim
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySarah Guido
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 

What's hot (20)

Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Spotify: Data center & Backend buildout
Spotify: Data center & Backend buildoutSpotify: Data center & Backend buildout
Spotify: Data center & Backend buildout
 
Productive data engineer
Productive data engineerProductive data engineer
Productive data engineer
 
Real time ads personalization @ Spotify
Real time ads personalization @ SpotifyReal time ads personalization @ Spotify
Real time ads personalization @ Spotify
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Spotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great SuccessSpotify: Horizontal Scalability for Great Success
Spotify: Horizontal Scalability for Great Success
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the Ugly
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 

Similar to Shortening the feedback loop

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Henry Saputra
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General SessionCloudera, Inc.
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...Lucas Jellema
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Kubeflow at Spotify (For the Kubeflow Summit)
Kubeflow at Spotify (For the Kubeflow Summit)Kubeflow at Spotify (For the Kubeflow Summit)
Kubeflow at Spotify (For the Kubeflow Summit)Josh Baer
 
Apache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachApache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachCalculated Systems
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersEnabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersDataWorks Summit
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018iguazio
 

Similar to Shortening the feedback loop (20)

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Music streams
Music streamsMusic streams
Music streams
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
How and Why you can and should Participate in Open Source Projects (AMIS, Sof...
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Kubeflow at Spotify (For the Kubeflow Summit)
Kubeflow at Spotify (For the Kubeflow Summit)Kubeflow at Spotify (For the Kubeflow Summit)
Kubeflow at Spotify (For the Kubeflow Summit)
 
Apache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachApache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop Approach
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop ClustersEnabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Shortening the feedback loop

  • 1. Shortening the Feedback Loop HowSpotify’sBigDataEcosystemHas EvolvedtoLeverageActionableInsights Josh Baer (jbx@spotify.com) Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
  • 2. Who am I? • Technical Product Owner at Spotify • Working with fast processing infrastructure • Previously, building out Spotify’s 2500 node Hadoop cluster @l_phant
  • 3. • Spotify Launches • Access to a gigantic catalog of music • Click to play instantaneous! In 2008
  • 6. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  • 7. “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  • 8. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  • 9. To leverage actionable insights, we need a faster feedback loop!
  • 10. • Music Streaming Service • Launched in 2008 • Premium and FreeTiers • Available in 60 Countries What is Spotify?
  • 11. Over 100 Million Active Users
  • 13. Over 1 Billion Plays Per Day
  • 14. And we have Data
  • 15. Hadoop at Spotify • ~2,500 Nodes • >100 PB Capacity • >100 TB Memory accessible by jobs • 20KJobs/Day
  • 16. Apache Kafka at Spotify • 500 Kafka-related machines • 40TB/day from logs
  • 17. Real-Time at Spotify • StormTopologies fed via Kafka • Mostly used for hack ideas or proof of concepts
  • 19. In the Beginning… • Spotifywas almost completely on-premise/bare metal • Grew to 2,500 node Hadoop cluster and over 10K total machines in production at four globally distributed data centers • “Flirted” with cloud providers at various times
  • 20. In 2014 • Maybe we should trythis cloud thing for real
  • 21. Why Move to the Cloud? • Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered • Owning and operating physical machines is not a competitive advantage for Spotify
  • 22. Why Google’s Cloud? • We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage
  • 24. BigQuery • Ad-hoc and interactive querying service for massive datasets • Like Hive, but without needing to manage Hadoop and servers • Leverages Google’s internal tech • Dremel (query execution engine) • Colossus (distributed storage) • Borg (distributed compute) • Jupiter (network) Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
  • 25. BigQuery vs. Hive • Example Queries: • What are the top 10 songs by popularity in Spain during October 2016? • How many hours did users in Spain spend listening to Spotify during October?
  • 26. BigQuery vs. Hive • What are the top 10 songs by popularity in Spain during October 2016? • Hive • 2647s (44min, 7sec) • 15.5TB processed • BigQuery • 108s (1min, 48sec) • 1.50TB processed Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  • 27. Top 10 Tracks in Spain during October 2016 Rank Artist(s) Track Name 1 J Balvin Safari 2 DJ Snake Let Me Love You 3 Ricky Mar8n Vente Pa' Ca 4 Sebas8an Yatra Traicionera 5 Zion & Lennox (feat. J Balvin) Otra Vez 6 Carlos Vives, Shakira La Bicicleta 7 The Chainsmokers Closer 8 Major Lazer (feat. Jus8n Bieber & MØ) Cold Water 9 Sia The Greatest 10 IAmChino (feat. Pitbull, Yandel & Chacal) Ay MI Dios
  • 28. BigQuery vs. Hive • How much time did users in Spain spend listening to Spotify during October? • Hive • 969s (16min, 9 sec) • 15.5TB processed • BigQuery • 33s • 780 GB processed Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  • 30. BigQuery at Spotify • Interactive and ad-hoc querying immediately started to transferto BQ once the data was available on the cloud • Pace of learning increases as friction to question decreases
  • 31. Cloud Pub/Sub • At least once globally distributed message queue • For high volume, low topic (<10,000) publish subscribe behavior • Like Kafka, but without needing to operate servers and supporting services (zookeeper)
  • 32. Cloud Pub/Sub at Spotify • 800K events/second? No problem • P99 Latency of ingestions into ES: 500ms • Ingestion from globally distributed non-GCP datacenters is painless
  • 33. • Managed Service for running batch and streaming jobs • UnifiedAPI for batch and streaming mode • Inspired by internal Google tools like FlumeJava and Millwheel • Programming model open-sourced asApache Beam (currently incubating) Cloud Dataflow
  • 34. • Usually run via Scio: https://github.com/spotify/scio • Scio provides a scalaAPI for running Dataflow jobs and provides easy integrations with BigQuery • New batch processing jobs at Spotify are being written in Scio/Dataflow Cloud Dataflow (Batch) at Spotify
  • 35. • Exactly-once stream processing framework • Areplacement for Spark/Flink streaming and Storm workloads at Spotify • Optimizes for consistencywhich can complicate real-time workloads Cloud Dataflow (Streaming) at Spotify
  • 36.
  • 37. Spotify + Google Cloud Timeline 2015 2016 Beginning of Google Cloud evaluation BigQuery begins to replace Hive Cloud Pub/Sub begins to replace Kafka Dataflow (streaming) begins to replace Storm Dataflow (batch) replacing Map/Reduce Note: Dates are approximations
  • 39. The Problem • We want to detect within minutes ifwe’ve introduced a bug in a client release that affects important event logging behavior
  • 40. Before… Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  • 41. Getting Data from Clients to Pub/Sub • Built Pulsar, a simple service aggregating data from Access Points and feeding it into Cloud Pub/Sub • Replaces the Kafka real-time event feed
  • 43. Dataflow • Subscribes to important event Pub/Sub topics • Aggregate events into minute windows • Always running, no need to schedule orwait for results
  • 44. BigQuery • Receives aggregates from Dataflow • Allows for ad-hoc inspection or slicing on different dimensions
  • 45. Tableau • DataVisualizationTool that integrates nicelywith BigQuery • Pulls data from BigQuery periodically and caches for quick inspection
  • 46.
  • 48.
  • 50. Problem As a developer, I want to be able to instantly explore data being logged bythe clients.
  • 51. Solution • Produce a topic for all employee client events • Store in Elasticsearch • Visualize in Kibana
  • 52.
  • 53.
  • 54. Benefits • Able to understand what’s being sent bythe client as it happens • Exploring events, visualizing distribution (i.e. does this field actually get populated) • Prototyping analysis based on a sample • Dashboards for Employee Releases
  • 56. Ad Targeting • Real-time genre targeting • Session insights — explicit filter
  • 58. Live Results for X-Factor • X-Factor: music competition • Songs available on Spotify immediately after show airs • Listener behavior determines the order of contestants on the playlist
  • 60. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  • 61. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries DAYS TO INSIGHTS
  • 62. To leverage actionable insights, we need a faster feedback loop!
  • 63. Putting it all together Milliseconds to transfer Milliseconds to process Seconds to Query SECONDS TO INSIGHTS
  • 64. TheValue of a Fast Feedback Loop • Detecting problems early in data avoids long backfills or long term data loss • Instant insights on newly developed features allows teams to iterate quicker and take risks • Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster
  • 65. UseAnything and Everything • Spotify has leveraged Google Cloud tools, such as Pub/ Sub, Dataflow and BigQuery • Opensource and other cloud providers offer many alternatives to this stack • Opensource tools (Elasticsearch/Kibana) and proprietary solutions (Tableau) have also been useful additions
  • 66. WhereAre We Going? • The real-time mission is in the early stages at Spotify
  • 67. Stream Processing First • The sun never sets on Spotify, why impose boundaries on our datasets? • What’s the shortest distance between two points? Zero! • Can we reduce the feedback cycle to zero?
  • 68. We’reHiring! Engineers, Managers, Product Owners needed in NYC and Stockholm https://www.spotify.com/jobs