2. 2
About me
- Data Engineer at DDT, Cathay Financial Holdings
- Former one-stop engineer for data science(Manufacturing)
- Former Chemical Engineer
- Polymer material, Genetic engineering, Bacterial fermentation
- First prize, Genius For Home competition, MediaTek, 2018
- D4SG (Data for Social Good) #4, winter 2018
- : orcahmlee
8. Two cases I want to handle in the real world
8
1. Many small datasets which share the same business logic
2. An out-of-core dataset without rewriting the ETL script
18. 18
Programming model
- Tasks
- A task executes on a stateless worker
- A future representing the result of the task is returned immediately
- Futures can be passed to other remote functions
- Idempotence
- Actors
- An actor executes on a stateful worker
- Each actor exposes methods that can be executed
- An actor’s method execution is similar to a task
- A handle to an actor can be passed to other actors or tasks
Ray: A Distributed Framework for Emerging AI Applications
[COSCUP 2011] Programming for the Future, Introduction to the Actor Model and Akka Framework
22. Something I want to share
22
- Components
- Global Control Store(GCS)
- Bottom-Up Distributed Scheduler
- In-Memory Distributed Object Store
- Features
- Handling Dependencies
25. - Maintains the entire control state of the system
- Key-value store with pub-sub functionality
- Redis as storage(< v1.11.0)
- v1.11.0+: No longer starts Redis as default
- Enables every components in the system to be stateless
- The primary reasons: fault tolerance and low latency
Global Control Store(GCS)
25
Ray: A Distributed Framework for Emerging AI Applications
26. Fault tolerance
- Heartbeat table
- Job table
- Task table
- Actor table
- Decouple the durable lineage storage from other system
components
Global Control Store(GCS)
26
Ray: A Distributed Framework for Emerging AI Applications
27. Global Control Store(GCS)
27
Ray: A Distributed Framework for Emerging AI Applications
Low latency
- Centralized scheduler couple task scheduling and task
dispatch(Dask, Spark, CIEL)
- Involving the scheduler in each object transfer is
prohibitively expensive
- Ray store the object metadata in GCS rather than in the
scheduler, fully decoupling task dispatch from task scheduling
31. - Plasma: A High-Performance Shared-Memory Object Store
- Plasma was initially developed as part of Ray that is being developed as part of
Apache Arrow(https://arrow.apache.org/docs/python/plasma.html)
- To minimize task latency, Ray has an in-memory
distributed storage system to store the inputs and
outputs of every task, or stateless computation.
- On each node, Ray has the object store via shared memory.
This allows zero-copy data sharing between tasks running
on the same node.
In-Memory Distributed Object Store
31
Ray: A Distributed Framework for Emerging AI Applications
32. In-Memory Distributed Object Store
- Spilling objects to external storage once the capacity of
the object store is used up(v1.3+)
- Two types of external storage supported by default:
- Local storage, S3
- Ray recovers any needed objects through lineage
re-execution. The lineage stored in the GCS tracks both
stateless tasks and stateful actors during initial
execution
32
Ray: A Distributed Framework for Emerging AI Applications
33. Handling Dependencies
When your script running on the distributed system……
- need some specific environment variables
- import/depend on some Python packages
- read some files outside of the script
- ModuleNotFoundError, FileNotFoundError
33
Ray: Handling dependencies
41. - Dask DataFrame and Koalas
- Lazy execution
- Support row-oriented partitioning and parallelism
- Modin
- Eager execution
- Support row, column, and cell-oriented partitioning
and parallelism
Modin vs. Dask DataFrame vs. Koalas
41
Modin vs. Dask DataFrame vs. Koalas
43. - Dask DataFrame and Koalas
- Lazy execution
- Support row-oriented partitioning and parallelism
- Modin
- Eager execution
- Support row, column, and cell-oriented partitioning
and parallelism
- If the API is not supported yet, it is being executed
in the default to pandas mode
Modin vs. Dask DataFrame vs. Koalas
43
Modin vs. Dask DataFrame vs. Koalas