SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Data Workflows
at Foursquare
using Luigi
Foursquare
•  35 million users
•  Nearly 4 billion check-ins
•  More than 5 million check-ins per day
•  50 million point-of-interest database
•  100's of GB of log data per day
Tools We Use
•  Hive
o  Ad hoc analytics, data dumping ground
•  Raw MapReduce
o  100's of MapReduce jobs in our codebase
•  Pig
o  Fits between structure Hive and free-form
MapReduce
•  Vertica
o  Low latency analytics
Cron
E.g.
0 0 * * * ./hadoop-script-1.sh
# Wait two hours for that job to finish...
0 2 * * * ./hadoop-script-2.sh
# And on and on and on
Cron - Problems
•  Brittle
•  Hard to reason about / visualize
•  Spend a lot of time waiting
•  Difficult to tell what succeeded or failed
•  No one likes writing Bash scripts
Oozie
XML-based Workflow Engine, with support for
Hadoop, Hive, and Pig
Workflows specify computations in a DAG, e.g
"Run this Hive query, then run these two
MapReduce jobs in parallel"
Coordinators launch recurring workflows at a
given frequency, when dependent data is
available
Oozie - Example
Oozie - Problems
•  Workflows are all-or-nothing
o  Cannot just run step that failed
o  Very little code reuse
•  Little to no extensibility
•  Limited control flow
•  Extremely verbose
•  Difficult to test
•  No one likes writing XML
Luigi
•  Python framework for batch processing jobs
•  Created by Spotify, open-sourced Sept. 2012
•  Tasks are units of work that produce Targets
•  Tasks can depend on one or more other Tasks
•  A Task is only run if all of its dependent Tasks are done
•  Tasks are idempotent
Luigi - Example Task
Luigi - Running the Task
$ python word-count.py WordCount --date 2013-06-01
Luigi - Scheduler
Central scheduler ensures each Task is only
run by a single worker.
A task is uniquely identified by its class name
and its Parameters, e.g.
WordCount(date=2013-06-01)
Will retry failed Tasks after a configured timeout
Emails someone when a Task fails
Luigi - Visualizer
Luigi - Visualizer
Luigi - Visualizer
Luigi - Advantages over Cron
•  Explicit dependencies
•  No wasted time waiting
•  Easy to tell what has failed
•  Avoid duplicate work / partial failures
Luigi - Advantages over Oozie
•  Explicit dependencies between workflows
•  Easier to write
•  Vastly more extensible
•  Code reuse
•  Can easily re-run individual steps
Thank you!
Check out Luigi:
https://github.com/spotify/luigi
Drop me a line:
Joe Ennever
jennever@foursquare.com

Contenu connexe

Tendances

Tendances (20)

Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
Virtual training Intro to InfluxDB & Telegraf
Virtual training  Intro to InfluxDB & TelegrafVirtual training  Intro to InfluxDB & Telegraf
Virtual training Intro to InfluxDB & Telegraf
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 

Similaire à Luigi presentation OA Summit

Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
Bill Buchan
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
Tomas Doran
 

Similaire à Luigi presentation OA Summit (20)

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social games
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
 
ProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.xProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.x
 
Profiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsProfiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty Details
 
Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)
 
APIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidadAPIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidad
 
SGCE 2015 REST APIs
SGCE 2015 REST APIsSGCE 2015 REST APIs
SGCE 2015 REST APIs
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisions
 
Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to Hadoop
 
Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1
 
CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
 

Plus de Open Analytics

MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
Open Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
Open Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
Open Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
Open Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
Open Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
Open Analytics
 

Plus de Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Luigi presentation OA Summit

  • 2. Foursquare •  35 million users •  Nearly 4 billion check-ins •  More than 5 million check-ins per day •  50 million point-of-interest database •  100's of GB of log data per day
  • 3. Tools We Use •  Hive o  Ad hoc analytics, data dumping ground •  Raw MapReduce o  100's of MapReduce jobs in our codebase •  Pig o  Fits between structure Hive and free-form MapReduce •  Vertica o  Low latency analytics
  • 4. Cron E.g. 0 0 * * * ./hadoop-script-1.sh # Wait two hours for that job to finish... 0 2 * * * ./hadoop-script-2.sh # And on and on and on
  • 5. Cron - Problems •  Brittle •  Hard to reason about / visualize •  Spend a lot of time waiting •  Difficult to tell what succeeded or failed •  No one likes writing Bash scripts
  • 6. Oozie XML-based Workflow Engine, with support for Hadoop, Hive, and Pig Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel" Coordinators launch recurring workflows at a given frequency, when dependent data is available
  • 8. Oozie - Problems •  Workflows are all-or-nothing o  Cannot just run step that failed o  Very little code reuse •  Little to no extensibility •  Limited control flow •  Extremely verbose •  Difficult to test •  No one likes writing XML
  • 9. Luigi •  Python framework for batch processing jobs •  Created by Spotify, open-sourced Sept. 2012 •  Tasks are units of work that produce Targets •  Tasks can depend on one or more other Tasks •  A Task is only run if all of its dependent Tasks are done •  Tasks are idempotent
  • 11. Luigi - Running the Task $ python word-count.py WordCount --date 2013-06-01
  • 12. Luigi - Scheduler Central scheduler ensures each Task is only run by a single worker. A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01) Will retry failed Tasks after a configured timeout Emails someone when a Task fails
  • 16. Luigi - Advantages over Cron •  Explicit dependencies •  No wasted time waiting •  Easy to tell what has failed •  Avoid duplicate work / partial failures
  • 17. Luigi - Advantages over Oozie •  Explicit dependencies between workflows •  Easier to write •  Vastly more extensible •  Code reuse •  Can easily re-run individual steps
  • 18. Thank you! Check out Luigi: https://github.com/spotify/luigi Drop me a line: Joe Ennever jennever@foursquare.com