Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

NECSTMondayTalk - 01/02/2021 - How easy can we make GPU scheduling?

GPUs are readily available in cloud computing and personal devices, but their use for data processing acceleration has been slowed down by their limited integration with common programming languages such as Python or Java. Moreover, using GPUs to their full capabilities requires expert knowledge of asynchronous programming.
In this work, we present a novel GPU run time scheduler for multi-task GPU computations that transparently provides asynchronous execution, space-sharing, and transfer-computation overlap without requiring in advance any information about the program dependency structure.
We leverage the GrCUDA polyglot API to integrate our scheduler with multiple high-level languages and provide a platform for fast prototyping and easy GPU acceleration. We validate our work on 6 benchmarks created to evaluate task-parallelism and show an average of 44% speedup against synchronous execution, with no execution time slowdown compared to hand-optimized host code written using the C++ CUDA Graphs API.

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

NECSTMondayTalk - 01/02/2021 - How easy can we make GPU scheduling?

  1. 1. Alberto Parravicini alberto.parravicini@polimi.it
  2. 2. 01/02/2021 Alberto Parravicini for i in range(10): flag = f1(x) f2(x, y) if flag: f3(z) else: f4(z) f5(x) Some random Python code, nothing unusual, right?
  3. 3. 01/02/2021 Alberto Parravicini for i in range(10): flag = f1(x) f2(x, y) if flag: f3(z) else: f4(z) f5(x) ● f2 and f3 (or f4) run concurrently ● f5 waits only f2, not f3 or f4 It also works with R, JS, Scala, and more
  4. 4. 01/02/2021 Alberto Parravicini f1(x) f2(y) f3(x, y)
  5. 5. 01/02/2021 Alberto Parravicini f1(x) f2(y) f3(x, y) cudaGraphCreate(&graph, 0); void *kernel_1_args[3] = {(void *)&x, (void *)&x1, &N}; void *kernel_2_args[3] = {(void *)&y, (void *)&y1, &N}; void *kernel_3_args[4] = {(void *)&x1, (void *)&y1, (void *)&res, &N}; dim3 tb(block_size_1d); dim3 bs(num_blocks); kernel_1_params.func = (void *)f1; kernel_1_params.blockDim = bs; kernel_1_params.gridDim = tb; kernel_1_params.kernelParams = kernel_1_args; kernel_1_params.sharedMemBytes = 0; kernel_1_params.extra = NULL; cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_1_params); kernel_2_params.func = (void *)f2; kernel_2_params.blockDim = bs; kernel_2_params.gridDim = tb; kernel_2_params.kernelParams = kernel_2_args; kernel_2_params.sharedMemBytes = 0; kernel_2_params.extra = NULL; cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_2_params); nodeDependencies.push_back(kernel_1); nodeDependencies.push_back(kernel_2); kernel_3_params.func = (void *)f3; kernel_3_params.blockDim = bs; kernel_3_params.gridDim = tb; kernel_3_params.kernelParams = kernel_3_args; kernel_3_params.sharedMemBytes = 0; kernel_3_params.extra = NULL; cudaGraphAddKernelNode(n, g, nodeDependencies.data(), nodeDependencies.size(), &kernel_3_params); cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0); cudaGraphLaunch(graphExec, s1); err = cudaStreamSynchronize(s1);
  6. 6. 01/02/2021 Alberto Parravicini 1. How we can perform heterogeneous & asynchronous GPU scheduling 2. We support many high-level languages like Python, R, JS, etc. 3. Same performance (if not better!) as low-level, cumbersome, hand-optimized CUDA code
  7. 7. 01/02/2021 Alberto Parravicini The work presented today is the result of the ongoing collaboration between NECSTLab @ Polimi & Oracle Labs ● This work will appear as “DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime” in IPDPS 2021. Preprint: https://arxiv.org/pdf/2012.09646.pdf ● It’s also open-source! Go play with it! github.com/AlbertoParravicini/grcuda Big thanks to my co-authors Arnaud Delamare, Marco Arnaboldi and Prof. Marco Santambrogio! Also thanks to the original authors and developers of GrCUDA, Rene Mueller and Lukas Stadler!
  8. 8. 01/02/2021 Alberto Parravicini We have a puzzle with 4 pieces! Why do we care about GPUs?
  9. 9. 01/02/2021 Alberto Parravicini We have a puzzle with 4 pieces! Why do we care about GPUs? GraalVM & Polyglot Magic
  10. 10. 01/02/2021 Alberto Parravicini We have a puzzle with 4 pieces! Why do we care about GPUs? GraalVM & Polyglot Magic Enter GrCUDA!
  11. 11. 01/02/2021 Alberto Parravicini We have a puzzle with 4 pieces! Why do we care about GPUs? GraalVM & Polyglot Magic Our marvelous GPU scheduler Enter GrCUDA!
  12. 12. 01/02/2021 Alberto Parravicini GPUs were originally built for 3D graphic rendering ● But now they are everywhere: deep learning, engineering, image processing, finance, etc. They are still hard to use though, with 3 main issues 1. Programming GPUs is hard, they require knowledge of the architecture and thread-based programming model. Orthogonal to our work
  13. 13. 01/02/2021 Alberto Parravicini GPUs were originally built for 3D graphic rendering ● But now they are everywhere: deep learning, engineering, image processing, finance, etc. They are still hard to use though, with 3 main issues 1. Programming GPUs is hard, they require knowledge of the architecture and thread-based programming model. Orthogonal to our work 2. They are difficult to integrate. Lot of boilerplate host code, and robust APIs are C++ only. GrCUDA to the rescue!
  14. 14. 01/02/2021 Alberto Parravicini GPUs were originally built for 3D graphic rendering ● But now they are everywhere: deep learning, engineering, image processing, finance, etc. They are still hard to use though, with 3 main issues 1. Programming GPUs is hard, they require knowledge of the architecture and thread-based programming model. Orthogonal to our work 2. They are difficult to integrate. Lot of boilerplate host code, and robust APIs are C++ only. GrCUDA to the rescue! 3. The runtime is difficult to exploit, asynchronous execution, CPU+GPU cooperation, etc. Today we focus on this!
  15. 15. 01/02/2021 Alberto Parravicini GraalVM is a giant mega-project at Oracle Labs Main idea: high-performance polyglot JVM runtime ● It supports many languages (Python, R, Scala, JavaScript, etc.), compiled to Java bytecode and ran in a JVM ● All languages are intercompatible, e.g. call JS code directly from your Java application (example from www.graalvm.org/docs/getting-started/ #polyglot-capabilities-of-native-images) // PrettyPrintJSON.java import java.io.*; import java.util.stream.*; import org.graalvm.polyglot.*; public class PrettyPrintJSON { public static void main(String[] args) throws java.io.IOException { BufferedReader reader = new BufferedReader(new InputStreamReader(System.in)); String input = reader.lines() .collect(Collectors.joining(System.lineSeparator())); try (Context context = Context.create("js")) { Value parse = context.eval("js", "JSON.parse"); Value stringify = context.eval("js", "JSON.stringify"); Value result = stringify.execute(parse.execute(input), null, 2); System.out.println(result.asString()); } } }
  16. 16. 01/02/2021 Alberto Parravicini GraalVM provides 3 big advantages 1. All languages are built on top of the same backend, and benefit from the same optimizations 2. It is polyglot, languages can easily cooperate 3. It’s easy to add new languages! All languages are mapped to the same Intermediate Representation (Truffle) ● This (tree-like) IR is optimized, possibly in part or speculatively, and translated to Java bytecode and then to machine code ● It’s really complex! But it allows to create new dynamic languages without worrying about optimizations!
  17. 17. 01/02/2021 Alberto Parravicini GrCUDA is a GraalVM-based DSL that exposes the CUDA API to Java, R, Python, JavaScript, etc. ● GPU acceleration for high-level languages through a unified backend GrCUDA provides many benefits ● Simplify data-transfer with Unified Memory ● Just-In-Time CUDA compilation ● Support for any CUDA kernel and library
  18. 18. 01/02/2021 Alberto Parravicini GPU KERNEL __global__ void inc_kernel(int* x, int N) { for (int i = threadIdx.x + blockIdx.x * blockDim.x; i<N; i += gridDim.x * blockDim.x) { x[i] += 1; } } PYTHON import polyglot cu = polyglot.eval(language='grcuda', string='CU') inc_kernel = cu.buildkernel(INC_KERNEL_STR, 'inc_kernel(x: inout pointer sint32, N: uint64)') device_array = cu.DeviceArray('int', 100) for i in range(len(device_array)): device_array[i] = i inc_kernel(32, 256)(device_array, len(device_array)) R cu <- eval.polyglot('grcuda', 'CU') inc_kernel <- cu$buildkernel(KERNEL_STR,'...') num_elements <- 100 device_array <- cu$DeviceArray('int', num_elements) (array init omitted) inc_kernel(32, 256)(device_array, num_elements) JAVASCRIPT const cu = Polyglot.eval('grcuda', 'CU') const inc_kernel = cu.buildkernel(INC_KERNEL_STR, `inc_kernel(x: inout pointer sint32, N: sint32)`) const n = 100 let deviceArray = cu.DeviceArray('int', n) for (let i = 0; i < n; i++) deviceArray[i] = i inc_kernel(32, 256)(deviceArray, n)
  19. 19. 01/02/2021 Alberto Parravicini Biggest limitation, no support for asynchronous execution ● Huge performance gains left on the table GPUs are great for parallel computing, but also excel in multi-kernel asynchronous computations 1. Run concurrent GPU computations (space-sharing) 2. Run GPU computations concurrently to CPU 3. Overlap data-transfer with computations Extracting full performance in multi-kernel computations is hard ● Synchronization events and data-movement must be hand-optimized ● Full CUDA API is only available to C/C++ Asynchronous execution provides an average of 60% speedup on a Tesla P100
  20. 20. 01/02/2021 Alberto Parravicini Our goals ● Extract every ounce of asynchronicity from GrCUDA ● Do it automatically, transparently to the user We represent GPU computations as vertices of a DAG, connected through data dependencies ● We schedule parallel computations and limit synchronizations
  21. 21. 01/02/2021 Alberto Parravicini Some frameworks deal with GPU scheduling, such as TensorFlow and CUDA Graphs by Nvidia What’s new here? 1. It’s fully transparent to the user, the API of GrCUDA is not modified 2. Dependencies are computed at runtime, not at compile time or eagerly ● GraalVM partial evaluation minimizes the runtime overheads (e.g. repeated array accesses) 3. Updates to the GrCUDA runtime are immediately available to every GraalVM language ● Instead of having different libraries: PyCUDA, JCuda, GPU.js, etc.
  22. 22. 01/02/2021 Alberto Parravicini ● Computations (GPU kernels and CPU array accesses) are abstracted as DAG vertices ● Kernel invocation is asynchronous, the CPU execution is blocked only when it needs results ● Computations executed from the host language are captured and added to the DAG ● Data dependency computation is aware of read-only arguments and finished computations ● Data can be prefetched for maximum performance, and kernels use multiple streams No user-defined dependencies in the scheduling
  23. 23. 01/02/2021 Alberto Parravicini ● Kernel invocations are wrapped into computational elements (1) ● The GrCUDA execution context computes data-dependencies and updates the DAG (2, 3) ● The computation is assigned a CUDA stream based on dependencies and availability (4) ● The execution context schedules the computation on GPU (5, 6) ● Data prefetching and event synchronizations are non-blocking and asynchronous ● New components are highlighted in red
  24. 24. 01/02/2021 Alberto Parravicini 6 custom benchmarks to evaluate multi-task computations ● Tested on Nvidia Tesla P100 (high-end data-center GPU) and Nvidia GTX 1660 Super and GTX 960 (customer-grade GPUs) ● Note: dependency DAGs shown for clarity, but we never require the full DAGs!
  25. 25. 01/02/2021 Alberto Parravicini ● We are always faster than the original GrCUDA implementation, especially when using automatic prefetching ● We are not slower (and often faster) than the highly optimized CUDA Graphs, which requires manual dependencies
  26. 26. 01/02/2021 Alberto Parravicini ● We are always faster than the original GrCUDA implementation, especially when using automatic prefetching ● We are not slower (and often faster) than the highly optimized CUDA Graphs, which requires manual dependencies
  27. 27. 01/02/2021 Alberto Parravicini Our scheduler exploits untapped GPU resources Higher values for ● Device memory throughput ● L2 cache utilization ● Instructions completed per clock (IPC) ● GFLOPS (single and double precision)
  28. 28. 01/02/2021 Alberto Parravicini ● Started development for multi-GPU support ● Big thanks to Qi Zhou! ● Scheduling is more complex: some benchmarks are faster (B&S, 1.8x), some are slower (VEC, 0.35x) ● Other possible directions ● Applications on top of GrCUDA: e.g. sparse linear algebra, GrCUDA transparently maintains multiple data layouts (CSC, CSR, etc.) ● Integration with DSL: taking full advantage of asynchronous execution, and simplify GPU code writing
  29. 29. ● A new scheduler for GrCUDA for transparent async execution ● 44% faster than synchronous execution ● Fully integrated with GraalVM, available for Python, R, Java, JavaScript, etc. ● Open Source: github.com/AlbertoParravicini/grcuda ● Paper: arxiv.org/pdf/2012.09646.pdf Alberto Parravicini alberto.parravicini@polimi.it 2020-02-01
  30. 30. 01/02/2021 Alberto Parravicini Our goals ● Extract every ounce of asynchronicity from GrCUDA ● Do it automatically, transparently to the user We represent GPU computations as vertices of a DAG, connected through data dependencies ● We schedule parallel computations and limit synchronizations Plenty of use cases: ● GPU Graph/Database querying (union of subqueries) ● Image processing pipelines (combine multiple filters) ● Ensemble of ML models ● Combine predictions from different models on the same data

×