Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Ray The alternative to distributed frameworks.pdf

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 74 Publicité

Plus De Contenu Connexe

Plus récents (20)

Publicité

Ray The alternative to distributed frameworks.pdf

  1. 1. Ray: The alternative to distributed frameworks 李泓旻(Andrew Li)
  2. 2. 2 About me - Data Engineer @Data Science & Technology, Cathay Financial Holdings - Former one-stop engineer for data science(Manufacturing) - Former Chemical Engineer - Polymer material, Genetic engineering, Bacterial fermentation - D4SG (Data for Social Good) #4, winter 2018 - First prize, Genius For Home competition, MediaTek, 2018 - : orcahmlee
  3. 3. 3 Source
  4. 4. What will you get 4
  5. 5. Why We Need 5
  6. 6. 6 Four Reasons Why Leading Companies Are Betting On Ray, Anyscale How Ray’s ecosystem powers Spotify’s ML scientists and engineers
  7. 7. 7
  8. 8. 8 What if We Could
  9. 9. 9
  10. 10. 10 Four Reasons Why Leading Companies Are Betting On Ray, Anyscale How Ray’s ecosystem powers Spotify’s ML scientists and engineers
  11. 11. What is Ray? 11
  12. 12. 12 Ray
  13. 13. 13 Ray
  14. 14. 14 Ray Tune: Tuning with your favorite ML framework
  15. 15. 15 Ray Tune: Tuning with your favorite framework and more......
  16. 16. 16 Ray Tune: Tuning with your favorite framework
  17. 17. 17 Ray Tune: Tuning with your favorite framework
  18. 18. 18 Ray Tune: Tuning with your favorite framework
  19. 19. 19 Ray Tune: Tuning with your favorite framework
  20. 20. 20 Ray Tune: Tuning with your favorite framework
  21. 21. 21 Ray Tune: Tuning with your favorite framework search_optimization Algorithm "random" (Random Search) "bayesian" SkoptSearch "hyperopt" HyperOptSearch "bohb" TuneBOHB "optuna" Optuna
  22. 22. 22 Modin: A drop-in replacement for pandas
  23. 23. Modin: A drop-in replacement for pandas 23
  24. 24. Modin: A drop-in replacement for pandas 24
  25. 25. Modin: Architecture 25 Modin: Architecture
  26. 26. pandas API coverage 26 Modin vs. Dask DataFrame vs. Koalas
  27. 27. - Dask DataFrame and Koalas - Lazy execution - Support row-oriented partitioning and parallelism - Modin - Eager execution - Support row, column, and cell-oriented partitioning and parallelism Modin vs. Dask DataFrame vs. Koalas 27 Modin vs. Dask DataFrame vs. Koalas
  28. 28. Decomposition 28 Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System
  29. 29. - Dask DataFrame and Koalas - Lazy execution - Support row-oriented partitioning and parallelism - Modin - Eager execution - Support row, column, and cell-oriented partitioning and parallelism - If the API is not supported yet, it is being executed in the default to pandas mode Modin vs. Dask DataFrame vs. Koalas 29 Modin vs. Dask DataFrame vs. Koalas
  30. 30. default to pandas 30 Defaulting to pandas
  31. 31. Supported APIs - pd.DataFrame - Y: iloc, T, all, any, quantile, apply, applymap…… - D: plot, to_parquet, to_pickle, to_json…… - pd.Series - Y: iloc, T, all, any, quantile, apply, value_counts, to_frame…… - D: plot, to_parquet, to_pickle, to_json…… - pd.read_<file> - Y: read_csv, read_parquet…… - D: read_pickle, read_html…… - Utilities - Y: pd.concat, pd.unique, pd.get_dummies…… - D: pd.cut, pd.to_datetime, pd.to_numeric…… 31 Supported APIs
  32. 32. 32 Ray Core
  33. 33. Ray 33
  34. 34. Ray: Programming Model 34
  35. 35. Actor, Stateful Task, Stateless Programming model 35 Fire and Forget, AIM-120 AMRAAM
  36. 36. Actor Model - What is Actor Model and why to use it - Related languages/frameworks implements Actor Model: - Erlang, RabbitMQ, Akka - Super useful references: - https:>/blog.techbridge.cc/2019/06/21/actor-model-in-web/ - [COSCUP 2011] Programming for the Future, Introduction to the Actor Model and Akka Framework 36
  37. 37. Function —> Task 37
  38. 38. 38 Class —> Actor
  39. 39. Programming model 39 Ray: A Distributed Framework for Emerging AI Applications
  40. 40. Programming model 40 Ray: A Distributed Framework for Emerging AI Applications
  41. 41. Specifying Resources 41
  42. 42. Ray: Architecture 42
  43. 43. Architecture 43 Ray: A Distributed Framework for Emerging AI Applications
  44. 44. Architecture - Application Layer 44 Ray: A Distributed Framework for Emerging AI Applications
  45. 45. Architecture - System Layer The system layer consists of three major components - Global Control Store(GCS) - Bottom-Up Distributed Scheduler - In-Memory Distributed Object Store 45 Ray: A Distributed Framework for Emerging AI Applications
  46. 46. Global Control Store(GCS) 46
  47. 47. Global Control Store 47 Ray: A Distributed Framework for Emerging AI Applications
  48. 48. Global Control Store 48 - Maintains fault tolerance and low latency - Enables every components in the system to be stateless - Key-value store with pub-sub functionality - < v1.11.0: Using Redis - >=v1.11.0: No longer starts Redis as default Ray: A Distributed Framework for Emerging AI Applications
  49. 49. Global Control Store (< v1.11.0) 49 Redis in Ray: Past and future
  50. 50. Global Control Store (>=v1.11.0) 50 Redis in Ray: Past and future
  51. 51. Global Control Store 51 - Maintains fault tolerance and low latency - Enables every components in the system to be stateless - Key-value store with pub-sub functionality - < v1.11.0: Using Redis - >=v1.11.0: No longer starts Redis as default Ray: A Distributed Framework for Emerging AI Applications
  52. 52. Global Control Store Fault tolerance - Decouple the durable lineage storage from other system components - Heartbeat table, Job table, Actor table 52 Ray: A Distributed Framework for Emerging AI Applications
  53. 53. Global Control Store Low latency - Centralized scheduler couple task scheduling and task dispatch(Dask, Spark, CIEL) - Involving the scheduler in each object transfer is prohibitively expensive - Ray store the object’s metadata in GCS rather than in the scheduler, fully decoupling task dispatch from task scheduling 53 Ray: A Distributed Framework for Emerging AI Applications
  54. 54. Bottom-Up Distributed Scheduler 54
  55. 55. Bottom-Up Distributed Scheduler 55 Ray: A Distributed Framework for Emerging AI Applications
  56. 56. Existing cluster computing frameworks: - Centralized schedulers: provide locality but at latencies in the tens of ms(Spark, CIEL, Dryed) - Distributed schedulers: can achieve high scale, but they either don’t consider data locality(work stealing), or assume tasks belong to independent jobs(Sparrow), or assume the computation graph is known(Canary) Bottom-Up Distributed Scheduler 56 Ray: A Distributed Framework for Emerging AI Applications
  57. 57. Bottom-Up Distributed Scheduler 57 Ray: A Distributed Framework for Emerging AI Applications
  58. 58. In-Memory Distributed Object Store 58
  59. 59. In-Memory Distributed Object Store 59 Ray: A Distributed Framework for Emerging AI Applications
  60. 60. - Plasma: A High-Performance Shared-Memory Object Store - Plasma was initially developed as part of Ray that is being developed as part of Apache Arrow - On each node, Ray implement the object store via shared memory. This allows zero-copy data sharing between tasks running on the same node - Plasma holds immutable objects in shared memory In-Memory Distributed Object Store 60 Ray: A Distributed Framework for Emerging AI Applications
  61. 61. - To minimize task latency, Plasma is used to store the inputs and outputs of every task, or stateless computation. - For low latency, Ray keep objects entirely in memory and evict them as needed to disk using an LRU policy - Small objects(<100 KiB): store in in-process object store - Large objects: store in shared memory object store In-Memory Distributed Object Store 61 Ray: A Distributed Framework for Emerging AI Applications
  62. 62. In-Memory Distributed Object Store 62
  63. 63. In-Memory Distributed Object Store 63
  64. 64. In-Memory Distributed Object Store Object spilling and persistence - Spilling objects to external storage once the capacity of the object store is used up(v1.3+) - Two types of external storage supported by default - For local storage, the OS would run out of inodes very quickly. If objects are smaller than 100MB, Ray fuses objects into a single file to avoid this problem 64
  65. 65. In-Memory Distributed Object Store 65 Fault Tolerance - Ray recovers any needed objects through lineage re-execution. The lineage stored in the GCS tracks both stateless tasks and stateful actors during initial execution Ray: A Distributed Framework for Emerging AI Applications
  66. 66. Ray: Cluster Launcher 66
  67. 67. Ray Cluster on GCP/AWS/Azure 67 VM VM VM
  68. 68. Ray Cluster on K8s 68 POD POD POD
  69. 69. Ray: Handling Dependencies 69
  70. 70. Handling Dependencies 70 Source
  71. 71. Handling Dependencies 71
  72. 72. RECAP 72
  73. 73. 73 Ray
  74. 74. Thank you for your time 74

×