SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
Shortening the
Feedback Loop
HowSpotify’sBigDataEcosystemHas
EvolvedtoProduceReal-timeInsights
Josh Baer (jbx@spotify.com)
Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Spotify’s 2500 node
Hadoop cluster
@l_phant
• Spotify Launches
• Instant Access to a gigantic
catalog of music
• Click to play instantaneous!
In 2008
Behind the Scenes:
Days to Insights
Behind the Scenes
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Operational
Monitoring
To leverage actionable
insights, we need a
faster feedback loop!
• Music Streaming Service
• Launched in 2008
• Premium and FreeTiers
• Available in 59 Countries
What is Spotify?
100+ Million Active
Users
30+ Million Songs
1+ Billion Plays/Day
And we have Data
Hadoop at Spotify
• 2,500 Nodes
• 130 PB Capacity
• 120TB Memory accessible by jobs
• 20KJobs/Day
Apache Kafka at Spotify
• 500 Kafka-related machines
• 40TB/day from logs
“Real-Time” at Spotify
• Storm Topologies fed via Kafka
• Powering
✦ Ad Targeting
✦ Real-time recommendations
✦ Real-time stream counts
Migratingto
theCloud
In the Beginning…
• Spotifywas almost completely on-premise/bare
metal
• 2500 node Hadoop cluster, over 10K machines in
production at four globally distributed data centers
• Grew with users: from 1M in 2009, over 100M in 2016
Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs
and increasing in reliability and variety of service
offered
• Owning and operating physical machines is not a
competitive advantage for Spotify
Why Google’s Cloud?
• We believe Google’s industry leading background
in Big Data technologies will give us a data
processing advantage
Google
Cloud
“Primitives”
BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop and servers
• Leverages Google’s internal tech
• Dremel (query execution engine)
• Colossus (distributed storage)
• Borg (distributed compute)
• Juniper (network)
Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
BigQuery vs. Hive
• Example Query: Find the top 10 songs by
popularity in Spain during October
• BigQuery (1.50 TB processed): 108s
• Hive(15.5TB processed): 2647s
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
BigQuery vs. Hive (example #2)
• Example Query: Find the total hours of music
listening in Spain during October
• BigQuery (780 GB processed): 33s
• Hive(15.5TB processed): 969s
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
•
Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name
1 J	Balvin Safari
2 DJ	Snake Let	Me	Love	You
3 Ricky	Mar8n Vente	Pa'	Ca
4 Sebas8an	Yatra Traicionera
5 Zion	&	Lennox	(feat.	J	Balvin) Otra	Vez
6 Carlos	Vives,	Shakira La	Bicicleta
7 The	Chainsmokers Closer
8 Major	Lazer	(feat.	Jus8n	Bieber	&	MØ) Cold	Water
9 Sia The	Greatest
10 IAmChino	(feat.	Pitbull,	Yandel	&	Chacal) Ay	MI	Dios
Time Spent Listening to
Spotify by users in Spain
during October
Nearly 10,000 Years!
BigQuery at Spotify
• Interactive and ad-hoc querying immediately
started to transferto BQ once the data was
available on the cloud
• Pace of learning increases as friction to question
decreases
Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish
subscribe behavior
• Like Kafka, but without needing to operate servers
and supporting services (zookeeper)
Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from globally distributed non-GCP
datacenters is painless
• Managed Service for running batch and streaming jobs
• UnifiedAPI for batch and streaming mode
• Inspired by internal Google tools like FlumeJava and
Millwheel
• Programming model open-sourced asApache Beam
(currently incubating)
Cloud Dataflow
• Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scalaAPI for running Dataflow jobs
and provides easy integrations with BigQuery
• New batch processing jobs @Spotify are being
written in Scio/Dataflow
Cloud Dataflow (Batch) at Spotify
• Exactly-once stream processing framework
• Areplacement for Spark/Flink streaming and
Storm workloads at Spotify
• Optimizes for consistencywhich can complicate
real-time workloads
Cloud Dataflow (Streaming) at Spotify
Spotify + Google Cloud Timeline
2015 2016
Beginning of Google
Cloud evaluation
BigQuery begins
to replace Hive
Cloud Pub/Sub begins
to replace Kafka
Dataflow (streaming)
begins to replace StormSpotify + Google
Cloud Announcement
Dataflow (batch)
replacing Map/Reduce
Note: Dates are approximations
Putting ItAll
Together
The Problem
• We want to detect within minutes ifwe’ve
introduced a bug in a client release that affects
critical event logging behavior
Before…
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
HOURS TO
INSIGHTS
Introducing “Pulsar”
• An internal name forthe system aggregating data
fromAccess Points and feeding it into Cloud Pub/
Sub
• Replaces the Kafka real-time event feed
Pulsar
Pub/Sub
• Aggregates global event feed from Pulsar
• Makes data available to multiple zones in
milliseconds
Dataflow
• Subscribes to critical event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need to schedule orwait for
results
BigQuery
• Receives aggregates from Dataflow
• Allows for ad-hoc inspection or slicing on different
dimensions
Tableau
• DataVisualizationTool that integrates nicelywith
BigQuery
• Pulls data from BigQuery periodically and caches for
quick inspection
Putting it all together
Putting it all together
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
Putting it all together
FasterInsights
toClient
Behavior
Problem
As a developer, I want to be able to instantly explore
data being logged bythe clients.
Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana
Benefits
• Able to understand what’s being sent bythe client
as it happens
• Exploring events, visualizing distribution (i.e. does
this field actually get populated)
• Prototyping analysis based on a sample
• Dashboards for Employee Releases
FasterInsights
onNew
Features
The previous dashboard is great for prototyping, but
what ifyou want all the data?
Problem
Solution
Allow developers to funnel feature-specific data to
their own elastic search cluster
Dataflow to the Rescue!
• We created a librarythat allows teams to build
maps/filters with simple java code
• Code gets translated into a Dataflow job
Abstract Away the Complexity
No Ops!
• For our users:
• Event-feed managed through Cloud Pub/Sub
• Dataflow managed by Google
• Shared Elasticsearch cluster (managed by an
infra team)
Low Ops :/
• Dataflow is improving, but it’s had some stability
issues with streaming jobs
• Teams may need to set-up their own Elasticsearch
cluster ifthey require a higher SLAthan default
OtherUses
Ad Targeting
• Real-time genre targeting
• Session insights — explicit filter
Real-time Recommendations
Live Results for X-Factor
• X-Factor: television music
competition
• Contest songs get loaded onto
Spotify immediately after show
airs
• Listener behavior determines the
order of contestants on the playlist
Review
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
To leverage actionable
insights, we need a
faster feedback loop!
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Operational
Monitoring
Cloud to the Rescue!
• Spotify has leveled up our abilityto gain actionable
insights by leveraging Google Cloud tools, such as
Pub/Sub, Dataflow and BigQuery
TheValue of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or
long term data loss
• Instant insights on newly developed features allows
teams to iterate quicker and take risks
• Providing a quicker ad-hoc querying engine allows teams
to ask more questions and learn faster
UseAnything and Everything
• Opensource and other cloud providers offer many
alternatives to the stack we’ve used
• Opensource tools, like Elasticsearch/Kibana, and
proprietary solutions, like Tableau, have also been
useful additions
WhereAre We Going?
• The real-time mission is in the early stages at
Spotify
Stream Processing First
• The sun never sets on Spotify, why impose
boundaries on our datasets?
• What’s the shortest distance between two lines?
Zero!
• Can we reduce the feedback cycle to zero?
We’reHiring!
Engineers, Managers, Product Owners
needed in NYC and Stockholm
https://www.spotify.com/jobs
Questions?

Contenu connexe

Tendances

Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Altan Khendup
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
DataWorks Summit
 

Tendances (20)

Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Zero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using HadoopZero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using Hadoop
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDisrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel Pedreschi
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 

En vedette

PostgreSQL - Lección 8 - Manipulando Datos y Transacciones
PostgreSQL - Lección 8 - Manipulando Datos y TransaccionesPostgreSQL - Lección 8 - Manipulando Datos y Transacciones
PostgreSQL - Lección 8 - Manipulando Datos y Transacciones
Nicola Strappazzon C.
 
0 to 2,500 Customers with No Cold Calls
0 to 2,500 Customers with No Cold Calls0 to 2,500 Customers with No Cold Calls
0 to 2,500 Customers with No Cold Calls
HubSpot
 
Strip your charts
Strip your chartsStrip your charts
Strip your charts
uwseidl
 
Resultado Final do Concurso de Bom Sucesso
Resultado Final do Concurso de Bom Sucesso Resultado Final do Concurso de Bom Sucesso
Resultado Final do Concurso de Bom Sucesso
Joao Rivonaldo Silva
 
Labor Market and Salary Survey in Russia
Labor Market and Salary Survey in RussiaLabor Market and Salary Survey in Russia
Labor Market and Salary Survey in Russia
Awara Direct Search
 
Lineadeltiempodelacomputacion Iiuac
Lineadeltiempodelacomputacion IiuacLineadeltiempodelacomputacion Iiuac
Lineadeltiempodelacomputacion Iiuac
Oscorp
 

En vedette (20)

2015 RAM 2500 3500 Details. El Paso - Albuquerque Dealers Jack Key New Mexico...
2015 RAM 2500 3500 Details. El Paso - Albuquerque Dealers Jack Key New Mexico...2015 RAM 2500 3500 Details. El Paso - Albuquerque Dealers Jack Key New Mexico...
2015 RAM 2500 3500 Details. El Paso - Albuquerque Dealers Jack Key New Mexico...
 
Flexible budget
Flexible budgetFlexible budget
Flexible budget
 
My ANTI-Resume Manifesto
My ANTI-Resume ManifestoMy ANTI-Resume Manifesto
My ANTI-Resume Manifesto
 
Development Applications 2008 05 26
Development Applications 2008 05 26Development Applications 2008 05 26
Development Applications 2008 05 26
 
PostgreSQL - Lección 8 - Manipulando Datos y Transacciones
PostgreSQL - Lección 8 - Manipulando Datos y TransaccionesPostgreSQL - Lección 8 - Manipulando Datos y Transacciones
PostgreSQL - Lección 8 - Manipulando Datos y Transacciones
 
Acoples rapidos
Acoples rapidosAcoples rapidos
Acoples rapidos
 
0 to 2,500 Customers with No Cold Calls
0 to 2,500 Customers with No Cold Calls0 to 2,500 Customers with No Cold Calls
0 to 2,500 Customers with No Cold Calls
 
Electrical Engineering Basics - What Design Engineers Need to Know
Electrical Engineering Basics - What Design Engineers Need to KnowElectrical Engineering Basics - What Design Engineers Need to Know
Electrical Engineering Basics - What Design Engineers Need to Know
 
MasterPlus - Sistema Binário
MasterPlus - Sistema BinárioMasterPlus - Sistema Binário
MasterPlus - Sistema Binário
 
Strip your charts
Strip your chartsStrip your charts
Strip your charts
 
Apresentacao
ApresentacaoApresentacao
Apresentacao
 
Resultado Final do Concurso de Bom Sucesso
Resultado Final do Concurso de Bom Sucesso Resultado Final do Concurso de Bom Sucesso
Resultado Final do Concurso de Bom Sucesso
 
Exames médicos valores - União Sindical
Exames médicos   valores - União SindicalExames médicos   valores - União Sindical
Exames médicos valores - União Sindical
 
Option Strategies
Option StrategiesOption Strategies
Option Strategies
 
2500 years of learning theory: The good, the bad & the ugly - Donald Clark
2500 years of learning theory: The good, the bad & the ugly - Donald Clark2500 years of learning theory: The good, the bad & the ugly - Donald Clark
2500 years of learning theory: The good, the bad & the ugly - Donald Clark
 
Sarah Palin\'s Shopping Spree
Sarah Palin\'s Shopping SpreeSarah Palin\'s Shopping Spree
Sarah Palin\'s Shopping Spree
 
Labor Market and Salary Survey in Russia
Labor Market and Salary Survey in RussiaLabor Market and Salary Survey in Russia
Labor Market and Salary Survey in Russia
 
The Recipe For Creating a Successful Startup Ecosystem
The Recipe For Creating a Successful Startup EcosystemThe Recipe For Creating a Successful Startup Ecosystem
The Recipe For Creating a Successful Startup Ecosystem
 
Catálogo de delícias
Catálogo de delíciasCatálogo de delícias
Catálogo de delícias
 
Lineadeltiempodelacomputacion Iiuac
Lineadeltiempodelacomputacion IiuacLineadeltiempodelacomputacion Iiuac
Lineadeltiempodelacomputacion Iiuac
 

Similaire à Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

State of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCState of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DC
Puppet
 

Similaire à Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer (20)

Shortening the feedback loop
Shortening the feedback loopShortening the feedback loop
Shortening the feedback loop
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Puppet Keynote by Ralph Luchs
Puppet Keynote by Ralph LuchsPuppet Keynote by Ralph Luchs
Puppet Keynote by Ralph Luchs
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
State of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCState of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DC
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Apache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachApache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop Approach
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
 
Music streams
Music streamsMusic streams
Music streams
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at Spotify
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 

Plus de Big Data Spain

Plus de Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

  • 1.
  • 2. Shortening the Feedback Loop HowSpotify’sBigDataEcosystemHas EvolvedtoProduceReal-timeInsights Josh Baer (jbx@spotify.com) Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
  • 3. Who am I? • Technical Product Owner at Spotify • Working with fast processing infrastructure • Previously, building out Spotify’s 2500 node Hadoop cluster @l_phant
  • 4. • Spotify Launches • Instant Access to a gigantic catalog of music • Click to play instantaneous! In 2008
  • 7. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries
  • 8. “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  • 9. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 Operational Monitoring
  • 10. To leverage actionable insights, we need a faster feedback loop!
  • 11. • Music Streaming Service • Launched in 2008 • Premium and FreeTiers • Available in 59 Countries What is Spotify?
  • 15. And we have Data
  • 16. Hadoop at Spotify • 2,500 Nodes • 130 PB Capacity • 120TB Memory accessible by jobs • 20KJobs/Day
  • 17. Apache Kafka at Spotify • 500 Kafka-related machines • 40TB/day from logs
  • 18. “Real-Time” at Spotify • Storm Topologies fed via Kafka • Powering ✦ Ad Targeting ✦ Real-time recommendations ✦ Real-time stream counts
  • 20. In the Beginning… • Spotifywas almost completely on-premise/bare metal • 2500 node Hadoop cluster, over 10K machines in production at four globally distributed data centers • Grew with users: from 1M in 2009, over 100M in 2016
  • 21. Why Move to the Cloud? • Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered • Owning and operating physical machines is not a competitive advantage for Spotify
  • 22. Why Google’s Cloud? • We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage
  • 24. BigQuery • Ad-hoc and interactive querying service for massive datasets • Like Hive, but without needing to manage Hadoop and servers • Leverages Google’s internal tech • Dremel (query execution engine) • Colossus (distributed storage) • Borg (distributed compute) • Juniper (network) Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
  • 25. BigQuery vs. Hive • Example Query: Find the top 10 songs by popularity in Spain during October • BigQuery (1.50 TB processed): 108s • Hive(15.5TB processed): 2647s Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  • 26. BigQuery vs. Hive (example #2) • Example Query: Find the total hours of music listening in Spain during October • BigQuery (780 GB processed): 33s • Hive(15.5TB processed): 969s Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  • 27. • Top 10 Tracks in Spain during October 2016 Rank Artist(s) Track Name 1 J Balvin Safari 2 DJ Snake Let Me Love You 3 Ricky Mar8n Vente Pa' Ca 4 Sebas8an Yatra Traicionera 5 Zion & Lennox (feat. J Balvin) Otra Vez 6 Carlos Vives, Shakira La Bicicleta 7 The Chainsmokers Closer 8 Major Lazer (feat. Jus8n Bieber & MØ) Cold Water 9 Sia The Greatest 10 IAmChino (feat. Pitbull, Yandel & Chacal) Ay MI Dios
  • 28. Time Spent Listening to Spotify by users in Spain during October Nearly 10,000 Years!
  • 29. BigQuery at Spotify • Interactive and ad-hoc querying immediately started to transferto BQ once the data was available on the cloud • Pace of learning increases as friction to question decreases
  • 30. Cloud Pub/Sub • At least once globally distributed message queue • For high volume, low topic (<10,000) publish subscribe behavior • Like Kafka, but without needing to operate servers and supporting services (zookeeper)
  • 31. Cloud Pub/Sub at Spotify • 800K events/second? No problem • P99 Latency of ingestions into ES: 500ms • Ingestion from globally distributed non-GCP datacenters is painless
  • 32. • Managed Service for running batch and streaming jobs • UnifiedAPI for batch and streaming mode • Inspired by internal Google tools like FlumeJava and Millwheel • Programming model open-sourced asApache Beam (currently incubating) Cloud Dataflow
  • 33. • Usually run via Scio: https://github.com/spotify/scio • Scio provides a scalaAPI for running Dataflow jobs and provides easy integrations with BigQuery • New batch processing jobs @Spotify are being written in Scio/Dataflow Cloud Dataflow (Batch) at Spotify
  • 34. • Exactly-once stream processing framework • Areplacement for Spark/Flink streaming and Storm workloads at Spotify • Optimizes for consistencywhich can complicate real-time workloads Cloud Dataflow (Streaming) at Spotify
  • 35. Spotify + Google Cloud Timeline 2015 2016 Beginning of Google Cloud evaluation BigQuery begins to replace Hive Cloud Pub/Sub begins to replace Kafka Dataflow (streaming) begins to replace StormSpotify + Google Cloud Announcement Dataflow (batch) replacing Map/Reduce Note: Dates are approximations
  • 37. The Problem • We want to detect within minutes ifwe’ve introduced a bug in a client release that affects critical event logging behavior
  • 38. Before… Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries HOURS TO INSIGHTS
  • 39. Introducing “Pulsar” • An internal name forthe system aggregating data fromAccess Points and feeding it into Cloud Pub/ Sub • Replaces the Kafka real-time event feed
  • 41. Pub/Sub • Aggregates global event feed from Pulsar • Makes data available to multiple zones in milliseconds
  • 42. Dataflow • Subscribes to critical event Pub/Sub topics • Aggregate events into minute windows • Always running, no need to schedule orwait for results
  • 43. BigQuery • Receives aggregates from Dataflow • Allows for ad-hoc inspection or slicing on different dimensions
  • 44. Tableau • DataVisualizationTool that integrates nicelywith BigQuery • Pulls data from BigQuery periodically and caches for quick inspection
  • 45. Putting it all together
  • 46. Putting it all together Milliseconds to transfer Milliseconds to process Seconds to Query SECONDS TO INSIGHTS
  • 47. Putting it all together
  • 49. Problem As a developer, I want to be able to instantly explore data being logged bythe clients.
  • 50. Solution • Produce a topic for all employee client events • Store in Elasticsearch • Visualize in Kibana
  • 51.
  • 52.
  • 53. Benefits • Able to understand what’s being sent bythe client as it happens • Exploring events, visualizing distribution (i.e. does this field actually get populated) • Prototyping analysis based on a sample • Dashboards for Employee Releases
  • 55. The previous dashboard is great for prototyping, but what ifyou want all the data? Problem
  • 56. Solution Allow developers to funnel feature-specific data to their own elastic search cluster
  • 57. Dataflow to the Rescue! • We created a librarythat allows teams to build maps/filters with simple java code • Code gets translated into a Dataflow job
  • 58. Abstract Away the Complexity
  • 59.
  • 60. No Ops! • For our users: • Event-feed managed through Cloud Pub/Sub • Dataflow managed by Google • Shared Elasticsearch cluster (managed by an infra team)
  • 61. Low Ops :/ • Dataflow is improving, but it’s had some stability issues with streaming jobs • Teams may need to set-up their own Elasticsearch cluster ifthey require a higher SLAthan default
  • 63. Ad Targeting • Real-time genre targeting • Session insights — explicit filter
  • 65. Live Results for X-Factor • X-Factor: television music competition • Contest songs get loaded onto Spotify immediately after show airs • Listener behavior determines the order of contestants on the playlist
  • 67. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries
  • 68. To leverage actionable insights, we need a faster feedback loop!
  • 69. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 Operational Monitoring
  • 70. Cloud to the Rescue! • Spotify has leveled up our abilityto gain actionable insights by leveraging Google Cloud tools, such as Pub/Sub, Dataflow and BigQuery
  • 71. TheValue of a Fast Feedback Loop • Detecting problems early in data avoids long backfills or long term data loss • Instant insights on newly developed features allows teams to iterate quicker and take risks • Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster
  • 72. UseAnything and Everything • Opensource and other cloud providers offer many alternatives to the stack we’ve used • Opensource tools, like Elasticsearch/Kibana, and proprietary solutions, like Tableau, have also been useful additions
  • 73. WhereAre We Going? • The real-time mission is in the early stages at Spotify
  • 74. Stream Processing First • The sun never sets on Spotify, why impose boundaries on our datasets? • What’s the shortest distance between two lines? Zero! • Can we reduce the feedback cycle to zero?
  • 75. We’reHiring! Engineers, Managers, Product Owners needed in NYC and Stockholm https://www.spotify.com/jobs