SlideShare a Scribd company logo
1 of 34
Download to read offline
Luigi
Batch data processing in
Python

Elias Freider
freider@spotify.com
Background
Why did we build Luigi?


We have a lot of data
 Billions of log messages (several TBs) every day
 Usage and backend stats, debug information
 What we want to do
   AB-testing
   Music recommendations
   Monthly/daily/hourly reporting
   Business metric dashboards
 We experiment a lot – need quick development cycles
How to do it?

Hadoop
  Data storage
  Clean, filter, join and aggregate data
Postgres
  Semi-aggregated data for dashboards
Cassandra
  Time series data
Defining a data flow
A bunch of data processing tasks with inter-dependencies
Example: Artist Toplist
Input is a list of Timestamp, Track, Artist tuples




                                 Artist
      Streams
                              Aggregation




                                  Top 10             Database
Example: Artist Toplist
Naive (non-luigi) approach
Errors will occur
If a failed step leaves behind broken data, we want to clean that up
Avoid duplicate work
The second step fails, you fix it, then you want to resume
Parametrization
To use data flows as command line tools
Repeat!
You want to run the dataflow for a set of similar inputs
Introducing Luigi
A Python framework for data flow definition and execution




Main features

  Dead-simple dependency definitions (think GNU make)
  Emphasis on Hadoop/HDFS integration
  Atomic file operations
  Powerful task templating using OOP
  Data flow visualization
  Command line integration
Luigi - Aggregate Artists
Luigi - Aggregate Artists
Run on the command line:
$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 74375] Running   AggregateArtists()
INFO: [pid 74375] Done      AggregateArtists()
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
Task templates and targets
Reduces code-repetition by utilizing Luigi’s object oriented data model




Luigi comes with pre-implemented Tasks for...

  Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files
  Running Hive and (soon) Pig queries
  Inserting data sets into Postgres


Writing new ones are as easy as defining an interface and
implementing run()
Hadoop MapReduce
Built-in Hadoop Streaming Python framework




Features

 Very slim interface
 Fetches error logs from Hadoop cluster and displays them to the user
 Class instance variables can be referenced in MapReduce code, which makes it
 easy to supply extra data in dictionaries etc. for map side joins
 Easy to send along Python modules that might not be installed on the cluster
 Basic support for secondary sort
 Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
Hadoop MapReduce
Built-in Hadoop Streaming Python framework
Attach Python modules
So you can use local modules remotely on the cluster
Local state is transferred
Set up local variables for map-side joins etc.
Completing the top list
Top 10 artists - Wrapped arbitrary Python code
Database support
Basic functionality for exporting to Postgres. Cassandra support is in the works
Running it all...
   DEBUG: Checking if ArtistToplistToDatabase() is complete
   INFO: Scheduled ArtistToplistToDatabase()
   DEBUG: Checking if Top10Artists() is complete
   INFO: Scheduled Top10Artists()
   DEBUG: Checking if AggregateArtists() is complete
   INFO: Scheduled AggregateArtists()
   DEBUG: Checking if Streams() is complete
   INFO: Done scheduling tasks
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 3
   INFO: [pid 74811] Running   AggregateArtists()
   INFO: [pid 74811] Done      AggregateArtists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 2
   INFO: [pid 74811] Running   Top10Artists()
   INFO: [pid 74811] Done      Top10Artists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 1
   INFO: [pid 74811] Running   ArtistToplistToDatabase()
   INFO: Done writing, importing at 2013-03-13 15:41:09.407138
   INFO: [pid 74811] Done      ArtistToplistToDatabase()
   DEBUG: Asking scheduler for work...
   INFO: Done
   INFO: There are no more tasks to run at this time
Data flow visualization
Light-weight scheduler daemon with a web interface
The results
Imagine how cool this would be with real data...
Task Parameters
Class variables with some magic




Tasks have implicit __init__




   Generates command line interface with typing and documentation
   $ python dataflow.py AggregateArtists --date 2013-03-05
Task Parameters
Combined usage example
Task Parameters
Makes it really easy to create aggregates over time
Multiple workers
 Basic multi-processing



$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
Large data flows
            (Screenshot from web interface)




                                                                                                                                                                                                                                                                                                                  EndSongCleaned                                                                       EndSongCleaned
                                                                                                                                                                                                                                                                                                                  (date=2013-03-16)                                                                   (date=2013-03-15)




                                                                                                                                                                                                AggregateTracks                                                                                                             AggregateTracks                                                 UserLocationDay                                  UserLocationDay              UserLocationDay        UserLocationDay         UserLocationDay           UserLocationDay               UserLocationDay      UserLocationDay     UserLocationDay     UserLocationDay     UserLocationDay
                                                                                                                                                                                                   (test=False,                MasterMetadata                                                                                    (test=False,                                                    (test=False,                                   (test=False,                (test=False,           (test=False,            (test=False,               (test=False,                    (test=False,      (test=False,        (test=False,        (test=False,        (test=False,
                                                                                                                                                                                                date=2013-03-16,              (date=2013-03-15)                                                                            date=2013-03-15,                                                 date=2013-03-16,                                 date=2013-03-14,            date=2013-03-13,        date=2013-03-08,        date=2013-03-12,          date=2013-03-07,              date=2013-03-11,     date=2013-03-10,    date=2013-03-09,    date=2013-03-15,    date=2013-03-06,
                                                                                                                                                                                                test_users=False)                                                                                                          test_users=False)                                                test_users=False)                                test_users=False)           test_users=False)       test_users=False)       test_users=False)         test_users=False)             test_users=False)    test_users=False)   test_users=False)   test_users=False)   test_users=False)




                                                                                                                                                                                              JoinArtistGids                                 JoinArtistGids                                                                                                                                                                                                                                                                                       UserLocationPeriod             UserLocationPeriod
                                                                                                                                                                                               (test=False,                                     (test=False,                                                                                                                                                                                                                                                                                         (test=False,                     (test=False,
                                                                                                                                                                                            date=2013-03-16,                               date=2013-03-15,                                                                                                                                                                                                                                                                                           period=10,                        period=10,
                                                                                                                                                                                               nshards=12,                                   nshards=12,                                                                                                                                                                                                                                                                                          date=2013-03-17,               date=2013-03-16,
                                                                                                                                                                                            test_users=False)                              test_users=False)                                                                                                                                                                                                                                                                                      test_users=False)              test_users=False)




  AggregateUserMatrices       AggregateUserMatrices       AggregateUserMatrices                    AggregateUserMatrices               AggregateUserMatrices      AggregateUserMatrices
                                                                                                                                                                                            AggregateByArtists      AggregateByArtists                                        AggregateByArtists               AggregateByArtists               AggregateByArtists             AggregateByArtists                AggregateByArtists            AggregateByArtists           AggregateByArtists      AggregateByArtists       AggregateByArtists
       (test=False,                (test=False,                (test=False,                              (test=False,                        (test=False,              (test=False,
                                                                                                                                                                                               (test=False,            (test=False,                ArtistFollows                 (test=False,                     (test=False,                      (test=False,                  (test=False,                        (test=False,                (test=False,                  (test=False,            (test=False,            (test=False,
    date=2013-03-16,            date=2013-03-14,            date=2013-03-13,                          date=2013-03-15,                   date=2013-03-12,           date=2013-03-11,
                                                                                                                                                                                            date=2013-03-16,        date=2013-03-13,            (date=2013-03-14)             date=2013-03-14,                 date=2013-03-09,                 date=2013-03-08,               date=2013-03-15,                  date=2013-03-12,              date=2013-03-11,              date=2013-03-07,        date=2013-03-10,         date=2013-03-06,
index_version=1362433077,   index_version=1362433077,   index_version=1362433077,               index_version=1362433077,           index_version=1362433077,   index_version=1362433077,
                                                                                                                                                                                            test_users=False)       test_users=False)                                         test_users=False)                test_users=False)                test_users=False)              test_users=False)                 test_users=False)             test_users=False)             test_users=False)       test_users=False)       test_users=False)
    test_users=False)           test_users=False)           test_users=False)                         test_users=False)                  test_users=False)          test_users=False)




                                                                              AccumulateUserMatrices                         AccumulateUserMatrices                                                                                                     AccumulateByArtists                                                                                                                                                                                         AccumulateByArtists
                                                                                    (test=False,                                   (test=False,                                                                                                                (test=False,                                                                                                                                                                                             (test=False,
                                                                                                                                                                                                                                                                                            JoinArtistGids                                                                JoinArtistGids                                              JoinArtistGids                                                                                                                 JoinArtistGids
                                                                                 date=2013-03-16,                               date=2013-03-15,                                                                                                         date=2013-03-16,                                                          PlaylistVectors                                                                                                                   date=2013-03-15,
                                                                                                                                                                                                                                                                                                (test=False,                                                               (test=False,                                                (test=False,                                                                                                                   (test=False,
                                                                          index_version=1362433077,                         index_version=1362433077,                                                                                                    test_users=False,                                                           (test=False,                                                    RelatedArtistsTC                                                test_users=False,
                                                                                                                                                                                                                                                                                          date=2013-03-14,                                                              date=2013-03-13,                                             date=2013-03-12,                                                                                                               date=2013-03-11,
                                                                                 test_users=False,                              test_users=False,                                                                                                          max_days=90,                                                           date=2013-03-14,                                                              ()                                                    max_days=90,
                                                                                                                                                                                                                                                                                            nshards=12,                                                                   nshards=12,                                                  nshards=12,                                                                                                                    nshards=12,
                                                                                 decay_factor=0.99,                             decay_factor=0.99,                                                                                                      decay_factor=0.99,                                                index_version=1362433077)                                                                                                                 decay_factor=0.99,
                                                                                                                                                                                                                                                                                         test_users=False)                                                              test_users=False)                                            test_users=False)                                                                                                              test_users=False)
                                                                                      days=5,                                        days=5,                                                                                                                    days=10,                                                                                                                                                                                                 days=10,
                                                                              build_from_scratch=True)                       build_from_scratch=True)                                                                                                build_from_scratch=True)                                                                                                                                                                                    build_from_scratch=True)




                                                                                                                                                                                                                                                                                                                            UserRecs                                                                                 UserRecs
                                                                                                                                                                                                                                                                                                                            (test=False,                                                                         (test=False,
                                                                                                                                                                                                                                                                                                                        date=2013-03-17,                                                                   date=2013-03-16,
                                                                                                                                                                                                                                                                                                                           rec_days=5,                                                                           rec_days=5,
                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                          exp_days=10,                                                                          exp_days=10,
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-15,
                                                                                                                                                                                                                                                                                                                        test_users=False,                                                                  test_users=False,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                       force_updates=False,                                                              force_updates=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
                                                                                                                                                                                                                                                                                                                     build_from_scratch=True,                                                          build_from_scratch=True,
                                                                                                                                                                                                                                                                                                                index_path=/spotify/discover/index,                                                index_path=/spotify/discover/index,
                                                                                                                                                                                                                                                                                                                       index_version=None,                                                                index_version=None,
                                                                                                                                                                                                                                                                                                                     FOLLOWS_SCORE=5.0)                                                                 FOLLOWS_SCORE=5.0)




                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-16,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)




                                                                                                                                                                                                                                                                                                                                                     IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                    (date=2013-03-17,
                                                                                                                                                                                                                                                                                                                                                     test_users=False,
                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
Really large...
Visualization needs some work
Error notifications
Great for automated execution
Process Synchronization
Prevents two identical tasks from running simultaneously


                                luigid
                       Simple task synchronization




             Data flow 1                   Data flow 2




                        Common dependency
                             Task
What we use Luigi for
Not just another Hadoop streaming framework!

 Hadoop Streaming
 Java Hadoop MapReduce
 Hive
 Pig
 Local (non-hadoop) data processing
 Import/Export data to/from Postgres
 Insert data into Cassandra
 scp/rsync/ftp data files and reports
 Dump and load databases
Others using it with Scala MapReduce and MRJob as well
Luigi is open source

http://github.com/spotify/luigi
Originated at Spotify
Based on many years of experience with data processing
Recent contributions by Foursquare and Bitly
Open source since September 2012
Thank you!
For more information feel free to reach out at
http://github.com/spotify/luigi

Oh, and we’re hiring – http://spotify.com/jobs



Elias Freider
freider@spotify.com

More Related Content

What's hot

Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Ken SASAKI
 
忙しい人の5分で分かるMesos入門 - Mesos って何だ?
忙しい人の5分で分かるMesos入門 - Mesos って何だ?忙しい人の5分で分かるMesos入門 - Mesos って何だ?
忙しい人の5分で分かるMesos入門 - Mesos って何だ?Masahito Zembutsu
 
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)NTT DATA Technology & Innovation
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話Kentaro Yoshida
 
Gocon2017:Goのロギング周りの考察
Gocon2017:Goのロギング周りの考察Gocon2017:Goのロギング周りの考察
Gocon2017:Goのロギング周りの考察貴仁 大和屋
 
[External] 2021.12.15 コンテナ移行の前に知っておきたいこと @ gcpug 湘南
[External] 2021.12.15 コンテナ移行の前に知っておきたいこと  @ gcpug 湘南[External] 2021.12.15 コンテナ移行の前に知っておきたいこと  @ gcpug 湘南
[External] 2021.12.15 コンテナ移行の前に知っておきたいこと @ gcpug 湘南Google Cloud Platform - Japan
 
AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)
AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)
AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
[CEDEC 2021] 運用中タイトルでも怖くない! 『メルクストーリア』におけるハイパフォーマンス・ローコストなリアルタイム通信技術の導入事例
[CEDEC 2021] 運用中タイトルでも怖くない! 『メルクストーリア』におけるハイパフォーマンス・ローコストなリアルタイム通信技術の導入事例[CEDEC 2021] 運用中タイトルでも怖くない! 『メルクストーリア』におけるハイパフォーマンス・ローコストなリアルタイム通信技術の導入事例
[CEDEC 2021] 運用中タイトルでも怖くない! 『メルクストーリア』におけるハイパフォーマンス・ローコストなリアルタイム通信技術の導入事例Naoya Kishimoto
 
ネットワーク ゲームにおけるTCPとUDPの使い分け
ネットワーク ゲームにおけるTCPとUDPの使い分けネットワーク ゲームにおけるTCPとUDPの使い分け
ネットワーク ゲームにおけるTCPとUDPの使い分けモノビット エンジン
 
分散システムの限界について知ろう
分散システムの限界について知ろう分散システムの限界について知ろう
分散システムの限界について知ろうShingo Omura
 
メタプログラミングって何だろう
メタプログラミングって何だろうメタプログラミングって何だろう
メタプログラミングって何だろうKota Mizushima
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたElasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたRyoji Kurosawa
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームKouhei Sutou
 
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送Google Cloud Platform - Japan
 
入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)
入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)
入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)NTT DATA Technology & Innovation
 
BigQuery Query Optimization クエリ高速化編
BigQuery Query Optimization クエリ高速化編BigQuery Query Optimization クエリ高速化編
BigQuery Query Optimization クエリ高速化編sutepoi
 

What's hot (20)

Hadoopの概念と基本的知識
Hadoopの概念と基本的知識Hadoopの概念と基本的知識
Hadoopの概念と基本的知識
 
忙しい人の5分で分かるMesos入門 - Mesos って何だ?
忙しい人の5分で分かるMesos入門 - Mesos って何だ?忙しい人の5分で分かるMesos入門 - Mesos って何だ?
忙しい人の5分で分かるMesos入門 - Mesos って何だ?
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
PostgreSQLをKubernetes上で活用するためのOperator紹介!(Cloud Native Database Meetup #3 発表資料)
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話爆速クエリエンジン”Presto”を使いたくなる話
爆速クエリエンジン”Presto”を使いたくなる話
 
Gocon2017:Goのロギング周りの考察
Gocon2017:Goのロギング周りの考察Gocon2017:Goのロギング周りの考察
Gocon2017:Goのロギング周りの考察
 
[External] 2021.12.15 コンテナ移行の前に知っておきたいこと @ gcpug 湘南
[External] 2021.12.15 コンテナ移行の前に知っておきたいこと  @ gcpug 湘南[External] 2021.12.15 コンテナ移行の前に知っておきたいこと  @ gcpug 湘南
[External] 2021.12.15 コンテナ移行の前に知っておきたいこと @ gcpug 湘南
 
AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)
AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)
AlloyDBを触ってみた!(第33回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
[CEDEC 2021] 運用中タイトルでも怖くない! 『メルクストーリア』におけるハイパフォーマンス・ローコストなリアルタイム通信技術の導入事例
[CEDEC 2021] 運用中タイトルでも怖くない! 『メルクストーリア』におけるハイパフォーマンス・ローコストなリアルタイム通信技術の導入事例[CEDEC 2021] 運用中タイトルでも怖くない! 『メルクストーリア』におけるハイパフォーマンス・ローコストなリアルタイム通信技術の導入事例
[CEDEC 2021] 運用中タイトルでも怖くない! 『メルクストーリア』におけるハイパフォーマンス・ローコストなリアルタイム通信技術の導入事例
 
ネットワーク ゲームにおけるTCPとUDPの使い分け
ネットワーク ゲームにおけるTCPとUDPの使い分けネットワーク ゲームにおけるTCPとUDPの使い分け
ネットワーク ゲームにおけるTCPとUDPの使い分け
 
分散システムの限界について知ろう
分散システムの限界について知ろう分散システムの限界について知ろう
分散システムの限界について知ろう
 
メタプログラミングって何だろう
メタプログラミングって何だろうメタプログラミングって何だろう
メタプログラミングって何だろう
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたElasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみた
 
Apache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォームApache Arrow - データ処理ツールの次世代プラットフォーム
Apache Arrow - データ処理ツールの次世代プラットフォーム
 
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送
[Cloud OnAir] BigQuery の仕組みからベストプラクティスまでのご紹介 2018年9月6日 放送
 
入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)
入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)
入門 Kubeflow ~Kubernetesで機械学習をはじめるために~ (NTT Tech Conference #4 講演資料)
 
BigQuery Query Optimization クエリ高速化編
BigQuery Query Optimization クエリ高速化編BigQuery Query Optimization クエリ高速化編
BigQuery Query Optimization クエリ高速化編
 
NTT DATA と PostgreSQL が挑んだ総力戦
NTT DATA と PostgreSQL が挑んだ総力戦NTT DATA と PostgreSQL が挑んだ総力戦
NTT DATA と PostgreSQL が挑んだ総力戦
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Recently uploaded (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Luigi PyData presentation

  • 1. Luigi Batch data processing in Python Elias Freider freider@spotify.com
  • 2. Background Why did we build Luigi? We have a lot of data Billions of log messages (several TBs) every day Usage and backend stats, debug information What we want to do AB-testing Music recommendations Monthly/daily/hourly reporting Business metric dashboards We experiment a lot – need quick development cycles
  • 3. How to do it? Hadoop Data storage Clean, filter, join and aggregate data Postgres Semi-aggregated data for dashboards Cassandra Time series data
  • 4. Defining a data flow A bunch of data processing tasks with inter-dependencies
  • 5. Example: Artist Toplist Input is a list of Timestamp, Track, Artist tuples Artist Streams Aggregation Top 10 Database
  • 6. Example: Artist Toplist Naive (non-luigi) approach
  • 7. Errors will occur If a failed step leaves behind broken data, we want to clean that up
  • 8. Avoid duplicate work The second step fails, you fix it, then you want to resume
  • 9. Parametrization To use data flows as command line tools
  • 10. Repeat! You want to run the dataflow for a set of similar inputs
  • 11. Introducing Luigi A Python framework for data flow definition and execution Main features Dead-simple dependency definitions (think GNU make) Emphasis on Hadoop/HDFS integration Atomic file operations Powerful task templating using OOP Data flow visualization Command line integration
  • 12. Luigi - Aggregate Artists
  • 13. Luigi - Aggregate Artists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 14. Task templates and targets Reduces code-repetition by utilizing Luigi’s object oriented data model Luigi comes with pre-implemented Tasks for... Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files Running Hive and (soon) Pig queries Inserting data sets into Postgres Writing new ones are as easy as defining an interface and implementing run()
  • 15. Hadoop MapReduce Built-in Hadoop Streaming Python framework Features Very slim interface Fetches error logs from Hadoop cluster and displays them to the user Class instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joins Easy to send along Python modules that might not be installed on the cluster Basic support for secondary sort Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
  • 16. Hadoop MapReduce Built-in Hadoop Streaming Python framework
  • 17. Attach Python modules So you can use local modules remotely on the cluster
  • 18. Local state is transferred Set up local variables for map-side joins etc.
  • 19. Completing the top list Top 10 artists - Wrapped arbitrary Python code
  • 20. Database support Basic functionality for exporting to Postgres. Cassandra support is in the works
  • 21. Running it all... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 22. Data flow visualization Light-weight scheduler daemon with a web interface
  • 23. The results Imagine how cool this would be with real data...
  • 24. Task Parameters Class variables with some magic Tasks have implicit __init__ Generates command line interface with typing and documentation $ python dataflow.py AggregateArtists --date 2013-03-05
  • 26. Task Parameters Makes it really easy to create aggregates over time
  • 27. Multiple workers Basic multi-processing $ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
  • 28. Large data flows (Screenshot from web interface) EndSongCleaned EndSongCleaned (date=2013-03-16) (date=2013-03-15) AggregateTracks AggregateTracks UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay (test=False, MasterMetadata (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, (date=2013-03-15) date=2013-03-15, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-08, date=2013-03-12, date=2013-03-07, date=2013-03-11, date=2013-03-10, date=2013-03-09, date=2013-03-15, date=2013-03-06, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) JoinArtistGids JoinArtistGids UserLocationPeriod UserLocationPeriod (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-15, period=10, period=10, nshards=12, nshards=12, date=2013-03-17, date=2013-03-16, test_users=False) test_users=False) test_users=False) test_users=False) AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, ArtistFollows (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-16, date=2013-03-13, (date=2013-03-14) date=2013-03-14, date=2013-03-09, date=2013-03-08, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-07, date=2013-03-10, date=2013-03-06, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) AccumulateUserMatrices AccumulateUserMatrices AccumulateByArtists AccumulateByArtists (test=False, (test=False, (test=False, (test=False, JoinArtistGids JoinArtistGids JoinArtistGids JoinArtistGids date=2013-03-16, date=2013-03-15, date=2013-03-16, PlaylistVectors date=2013-03-15, (test=False, (test=False, (test=False, (test=False, index_version=1362433077, index_version=1362433077, test_users=False, (test=False, RelatedArtistsTC test_users=False, date=2013-03-14, date=2013-03-13, date=2013-03-12, date=2013-03-11, test_users=False, test_users=False, max_days=90, date=2013-03-14, () max_days=90, nshards=12, nshards=12, nshards=12, nshards=12, decay_factor=0.99, decay_factor=0.99, decay_factor=0.99, index_version=1362433077) decay_factor=0.99, test_users=False) test_users=False) test_users=False) test_users=False) days=5, days=5, days=10, days=10, build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) UserRecs UserRecs (test=False, (test=False, date=2013-03-17, date=2013-03-16, rec_days=5, rec_days=5, IngestUserRecs exp_days=10, exp_days=10, (date=2013-03-15, test_users=False, test_users=False, test_users=False, force_updates=False, force_updates=False, ttl=604800) build_from_scratch=True, build_from_scratch=True, index_path=/spotify/discover/index, index_path=/spotify/discover/index, index_version=None, index_version=None, FOLLOWS_SCORE=5.0) FOLLOWS_SCORE=5.0) IngestUserRecs (date=2013-03-16, test_users=False, ttl=604800) IngestUserRecs (date=2013-03-17, test_users=False, ttl=604800)
  • 30. Error notifications Great for automated execution
  • 31. Process Synchronization Prevents two identical tasks from running simultaneously luigid Simple task synchronization Data flow 1 Data flow 2 Common dependency Task
  • 32. What we use Luigi for Not just another Hadoop streaming framework! Hadoop Streaming Java Hadoop MapReduce Hive Pig Local (non-hadoop) data processing Import/Export data to/from Postgres Insert data into Cassandra scp/rsync/ftp data files and reports Dump and load databases Others using it with Scala MapReduce and MRJob as well
  • 33. Luigi is open source http://github.com/spotify/luigi Originated at Spotify Based on many years of experience with data processing Recent contributions by Foursquare and Bitly Open source since September 2012
  • 34. Thank you! For more information feel free to reach out at http://github.com/spotify/luigi Oh, and we’re hiring – http://spotify.com/jobs Elias Freider freider@spotify.com