SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
Distributing your pandas
ETL job using Ray and Modin
李泓旻(Andrew Li)
2
About me
- Data Engineer at DDT, Cathay Financial Holdings
- Former one-stop engineer for data science(Manufacturing)
- Former Chemical Engineer
- Polymer material, Genetic engineering, Bacterial fermentation
- First prize, Genius For Home competition, MediaTek, 2018
- D4SG (Data for Social Good) #4, winter 2018
- : orcahmlee
WHY
3
4
Downloads: 1.7B
Downloads per month: 92M
Ecosystem: 30+ packages
5
6
7
Two cases I want to handle in the real world
8
1. Many small datasets which share the same business logic
2. An out-of-core dataset without rewriting the ETL script
HOW
to handle
9
Many small datasets which share the same business logic
CASE I
10
CASE I
11
CASE I
12
CASE I
13
CASE I
14
You are already know
how to use Ray!
15
Ray 16
Ray 17
18
Programming model
- Tasks
- A task executes on a stateless worker
- A future representing the result of the task is returned immediately
- Futures can be passed to other remote functions
- Idempotence
- Actors
- An actor executes on a stateful worker
- Each actor exposes methods that can be executed
- An actor’s method execution is similar to a task
- A handle to an actor can be passed to other actors or tasks
Ray: A Distributed Framework for Emerging AI Applications
[COSCUP 2011] Programming for the Future, Introduction to the Actor Model and Akka Framework
19
Function & Class
20
Task & Actor
ray.init()
21
Ray Core API
Something I want to share
22
- Components
- Global Control Store(GCS)
- Bottom-Up Distributed Scheduler
- In-Memory Distributed Object Store
- Features
- Handling Dependencies
Architecture
23
Ray: A Distributed Framework for Emerging AI Applications
Global Control Store(GCS)
24
Ray: A Distributed Framework for Emerging AI Applications
- Maintains the entire control state of the system
- Key-value store with pub-sub functionality
- Redis as storage(< v1.11.0)
- v1.11.0+: No longer starts Redis as default
- Enables every components in the system to be stateless
- The primary reasons: fault tolerance and low latency
Global Control Store(GCS)
25
Ray: A Distributed Framework for Emerging AI Applications
Fault tolerance
- Heartbeat table
- Job table
- Task table
- Actor table
- Decouple the durable lineage storage from other system
components
Global Control Store(GCS)
26
Ray: A Distributed Framework for Emerging AI Applications
Global Control Store(GCS)
27
Ray: A Distributed Framework for Emerging AI Applications
Low latency
- Centralized scheduler couple task scheduling and task
dispatch(Dask, Spark, CIEL)
- Involving the scheduler in each object transfer is
prohibitively expensive
- Ray store the object metadata in GCS rather than in the
scheduler, fully decoupling task dispatch from task scheduling
Bottom-Up Distributed Scheduler
28
Ray: A Distributed Framework for Emerging AI Applications
Bottom-Up Distributed Scheduler
29
Ray: A Distributed Framework for Emerging AI Applications
In-Memory Distributed Object Store
30
Ray: A Distributed Framework for Emerging AI Applications
- Plasma: A High-Performance Shared-Memory Object Store
- Plasma was initially developed as part of Ray that is being developed as part of
Apache Arrow(https://arrow.apache.org/docs/python/plasma.html)
- To minimize task latency, Ray has an in-memory
distributed storage system to store the inputs and
outputs of every task, or stateless computation.
- On each node, Ray has the object store via shared memory.
This allows zero-copy data sharing between tasks running
on the same node.
In-Memory Distributed Object Store
31
Ray: A Distributed Framework for Emerging AI Applications
In-Memory Distributed Object Store
- Spilling objects to external storage once the capacity of
the object store is used up(v1.3+)
- Two types of external storage supported by default:
- Local storage, S3
- Ray recovers any needed objects through lineage
re-execution. The lineage stored in the GCS tracks both
stateless tasks and stateful actors during initial
execution
32
Ray: A Distributed Framework for Emerging AI Applications
Handling Dependencies
When your script running on the distributed system……
- need some specific environment variables
- import/depend on some Python packages
- read some files outside of the script
- ModuleNotFoundError, FileNotFoundError
33
Ray: Handling dependencies
Handling Dependencies
34
35
An out-of-core dataset without rewriting the ETL script
HOW
to handle
CASE II
36
CASE II
37
You are already know
how to use Modin!
38
Architecture
39
Modin: Architecture
pandas API coverage
40
Modin vs. Dask DataFrame vs. Koalas
- Dask DataFrame and Koalas
- Lazy execution
- Support row-oriented partitioning and parallelism
- Modin
- Eager execution
- Support row, column, and cell-oriented partitioning
and parallelism
Modin vs. Dask DataFrame vs. Koalas
41
Modin vs. Dask DataFrame vs. Koalas
Decomposition
42
Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System
- Dask DataFrame and Koalas
- Lazy execution
- Support row-oriented partitioning and parallelism
- Modin
- Eager execution
- Support row, column, and cell-oriented partitioning
and parallelism
- If the API is not supported yet, it is being executed
in the default to pandas mode
Modin vs. Dask DataFrame vs. Koalas
43
Modin vs. Dask DataFrame vs. Koalas
default to pandas
44
Defaulting to pandas
Supported APIs
- pd.DataFrame
- Y: iloc, T, all, any, quantile, apply, applymap……
- D: plot, to_parquet, to_pickle, to_json……
- pd.Series
- Y: iloc, T, all, any, quantile, apply, value_counts, to_frame……
- D: plot, to_parquet, to_pickle, to_json……
- pd.read_<file>
- Y: read_csv, read_parquet……
- D: read_pickle, read_html……
- Utilities
- Y: pd.concat, pd.unique, pd.get_dummies……
- D: pd.cut, pd.to_datetime, pd.to_numeric……
45
Supported APIs
RECAP
46
Ray
47
- Scalable Dataframe
- High-level API
- out-of-memory
- High pandas API coverage
- Running on Ray/Dask/……
- Distributed framework
- Low-level API
- fit-in-memory(w/ pandas)
- Rich native/3rd libraries
- Deploy on Cloud/K8s/……
Modin
Thank you for your time
48

Contenu connexe

Tendances

Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...
Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...
Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...Kai Wähner
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton Araf Karsh Hamid
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Introduction: Intelligence Artificielle, Machine Learning et Deep Learning
Introduction: Intelligence Artificielle, Machine Learning et Deep LearningIntroduction: Intelligence Artificielle, Machine Learning et Deep Learning
Introduction: Intelligence Artificielle, Machine Learning et Deep LearningNcib Lotfi
 
Using Redis Streams To Build Event Driven Microservices And User Interface In...
Using Redis Streams To Build Event Driven Microservices And User Interface In...Using Redis Streams To Build Event Driven Microservices And User Interface In...
Using Redis Streams To Build Event Driven Microservices And User Interface In...Redis Labs
 
Introduction to GitHub Copilot
Introduction to GitHub CopilotIntroduction to GitHub Copilot
Introduction to GitHub CopilotAll Things Open
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Tal Bar-Zvi
 
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...Sonatype
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for DevelopersSarah Dutkiewicz
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Vadym Kazulkin
 

Tendances (20)

Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...
Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...
Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Introduction: Intelligence Artificielle, Machine Learning et Deep Learning
Introduction: Intelligence Artificielle, Machine Learning et Deep LearningIntroduction: Intelligence Artificielle, Machine Learning et Deep Learning
Introduction: Intelligence Artificielle, Machine Learning et Deep Learning
 
Using Redis Streams To Build Event Driven Microservices And User Interface In...
Using Redis Streams To Build Event Driven Microservices And User Interface In...Using Redis Streams To Build Event Driven Microservices And User Interface In...
Using Redis Streams To Build Event Driven Microservices And User Interface In...
 
Introduction to GitHub Copilot
Introduction to GitHub CopilotIntroduction to GitHub Copilot
Introduction to GitHub Copilot
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 
Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019 Kusto (Azure Data Explorer) Training for R&D - January 2019
Kusto (Azure Data Explorer) Training for R&D - January 2019
 
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
 
Machine-learning-FR.pdf
Machine-learning-FR.pdfMachine-learning-FR.pdf
Machine-learning-FR.pdf
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Terraform Basics
Terraform BasicsTerraform Basics
Terraform Basics
 
Azure DevOps for Developers
Azure DevOps for DevelopersAzure DevOps for Developers
Azure DevOps for Developers
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
 

Similaire à Distributing your pandas ETL job using Modin and Ray.pdf

Ray The alternative to distributed frameworks.pdf
Ray The alternative to distributed frameworks.pdfRay The alternative to distributed frameworks.pdf
Ray The alternative to distributed frameworks.pdfAndrew Li
 
Maria_Colgan_2.pdf
Maria_Colgan_2.pdfMaria_Colgan_2.pdf
Maria_Colgan_2.pdfLucky Ally
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareESUG
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLDESMOND YUEN
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...Hirofumi Iwasaki
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
Pi Day 2022 -  from IoT to MySQL HeatWave Database ServicePi Day 2022 -  from IoT to MySQL HeatWave Database Service
Pi Day 2022 - from IoT to MySQL HeatWave Database ServiceFrederic Descamps
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Performance advantages of Hadoop ETL offload with the Intel processor-powered...
Performance advantages of Hadoop ETL offload with the Intel processor-powered...Performance advantages of Hadoop ETL offload with the Intel processor-powered...
Performance advantages of Hadoop ETL offload with the Intel processor-powered...Principled Technologies
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018AlanCaldera
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo
 

Similaire à Distributing your pandas ETL job using Modin and Ray.pdf (20)

Ray The alternative to distributed frameworks.pdf
Ray The alternative to distributed frameworks.pdfRay The alternative to distributed frameworks.pdf
Ray The alternative to distributed frameworks.pdf
 
Maria_Colgan_2.pdf
Maria_Colgan_2.pdfMaria_Colgan_2.pdf
Maria_Colgan_2.pdf
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
Pi Day 2022 -  from IoT to MySQL HeatWave Database ServicePi Day 2022 -  from IoT to MySQL HeatWave Database Service
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Performance advantages of Hadoop ETL offload with the Intel processor-powered...
Performance advantages of Hadoop ETL offload with the Intel processor-powered...Performance advantages of Hadoop ETL offload with the Intel processor-powered...
Performance advantages of Hadoop ETL offload with the Intel processor-powered...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
 

Dernier

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Distributing your pandas ETL job using Modin and Ray.pdf

  • 1. Distributing your pandas ETL job using Ray and Modin 李泓旻(Andrew Li)
  • 2. 2 About me - Data Engineer at DDT, Cathay Financial Holdings - Former one-stop engineer for data science(Manufacturing) - Former Chemical Engineer - Polymer material, Genetic engineering, Bacterial fermentation - First prize, Genius For Home competition, MediaTek, 2018 - D4SG (Data for Social Good) #4, winter 2018 - : orcahmlee
  • 4. 4 Downloads: 1.7B Downloads per month: 92M Ecosystem: 30+ packages
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. Two cases I want to handle in the real world 8 1. Many small datasets which share the same business logic 2. An out-of-core dataset without rewriting the ETL script
  • 9. HOW to handle 9 Many small datasets which share the same business logic
  • 15. You are already know how to use Ray! 15
  • 18. 18 Programming model - Tasks - A task executes on a stateless worker - A future representing the result of the task is returned immediately - Futures can be passed to other remote functions - Idempotence - Actors - An actor executes on a stateful worker - Each actor exposes methods that can be executed - An actor’s method execution is similar to a task - A handle to an actor can be passed to other actors or tasks Ray: A Distributed Framework for Emerging AI Applications [COSCUP 2011] Programming for the Future, Introduction to the Actor Model and Akka Framework
  • 22. Something I want to share 22 - Components - Global Control Store(GCS) - Bottom-Up Distributed Scheduler - In-Memory Distributed Object Store - Features - Handling Dependencies
  • 23. Architecture 23 Ray: A Distributed Framework for Emerging AI Applications
  • 24. Global Control Store(GCS) 24 Ray: A Distributed Framework for Emerging AI Applications
  • 25. - Maintains the entire control state of the system - Key-value store with pub-sub functionality - Redis as storage(< v1.11.0) - v1.11.0+: No longer starts Redis as default - Enables every components in the system to be stateless - The primary reasons: fault tolerance and low latency Global Control Store(GCS) 25 Ray: A Distributed Framework for Emerging AI Applications
  • 26. Fault tolerance - Heartbeat table - Job table - Task table - Actor table - Decouple the durable lineage storage from other system components Global Control Store(GCS) 26 Ray: A Distributed Framework for Emerging AI Applications
  • 27. Global Control Store(GCS) 27 Ray: A Distributed Framework for Emerging AI Applications Low latency - Centralized scheduler couple task scheduling and task dispatch(Dask, Spark, CIEL) - Involving the scheduler in each object transfer is prohibitively expensive - Ray store the object metadata in GCS rather than in the scheduler, fully decoupling task dispatch from task scheduling
  • 28. Bottom-Up Distributed Scheduler 28 Ray: A Distributed Framework for Emerging AI Applications
  • 29. Bottom-Up Distributed Scheduler 29 Ray: A Distributed Framework for Emerging AI Applications
  • 30. In-Memory Distributed Object Store 30 Ray: A Distributed Framework for Emerging AI Applications
  • 31. - Plasma: A High-Performance Shared-Memory Object Store - Plasma was initially developed as part of Ray that is being developed as part of Apache Arrow(https://arrow.apache.org/docs/python/plasma.html) - To minimize task latency, Ray has an in-memory distributed storage system to store the inputs and outputs of every task, or stateless computation. - On each node, Ray has the object store via shared memory. This allows zero-copy data sharing between tasks running on the same node. In-Memory Distributed Object Store 31 Ray: A Distributed Framework for Emerging AI Applications
  • 32. In-Memory Distributed Object Store - Spilling objects to external storage once the capacity of the object store is used up(v1.3+) - Two types of external storage supported by default: - Local storage, S3 - Ray recovers any needed objects through lineage re-execution. The lineage stored in the GCS tracks both stateless tasks and stateful actors during initial execution 32 Ray: A Distributed Framework for Emerging AI Applications
  • 33. Handling Dependencies When your script running on the distributed system…… - need some specific environment variables - import/depend on some Python packages - read some files outside of the script - ModuleNotFoundError, FileNotFoundError 33 Ray: Handling dependencies
  • 35. 35 An out-of-core dataset without rewriting the ETL script HOW to handle
  • 38. You are already know how to use Modin! 38
  • 40. pandas API coverage 40 Modin vs. Dask DataFrame vs. Koalas
  • 41. - Dask DataFrame and Koalas - Lazy execution - Support row-oriented partitioning and parallelism - Modin - Eager execution - Support row, column, and cell-oriented partitioning and parallelism Modin vs. Dask DataFrame vs. Koalas 41 Modin vs. Dask DataFrame vs. Koalas
  • 42. Decomposition 42 Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System
  • 43. - Dask DataFrame and Koalas - Lazy execution - Support row-oriented partitioning and parallelism - Modin - Eager execution - Support row, column, and cell-oriented partitioning and parallelism - If the API is not supported yet, it is being executed in the default to pandas mode Modin vs. Dask DataFrame vs. Koalas 43 Modin vs. Dask DataFrame vs. Koalas
  • 45. Supported APIs - pd.DataFrame - Y: iloc, T, all, any, quantile, apply, applymap…… - D: plot, to_parquet, to_pickle, to_json…… - pd.Series - Y: iloc, T, all, any, quantile, apply, value_counts, to_frame…… - D: plot, to_parquet, to_pickle, to_json…… - pd.read_<file> - Y: read_csv, read_parquet…… - D: read_pickle, read_html…… - Utilities - Y: pd.concat, pd.unique, pd.get_dummies…… - D: pd.cut, pd.to_datetime, pd.to_numeric…… 45 Supported APIs
  • 47. Ray 47 - Scalable Dataframe - High-level API - out-of-memory - High pandas API coverage - Running on Ray/Dask/…… - Distributed framework - Low-level API - fit-in-memory(w/ pandas) - Rich native/3rd libraries - Deploy on Cloud/K8s/…… Modin
  • 48. Thank you for your time 48