SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
GPUDirect RDMA and Green
Multi-GPU Architectures

GE Intelligent Platforms
Mil/Aero Embedded Computing



Dustin Franklin, GPGPU Applications Engineer
 dustin.franklin@ge.com
 443.310.9812 (Washington, DC)
What this talk is about
 • GPU Autonomy
 • GFLOPS/watt and SWAP
 • Project Denver
 • Exascale
Without GPUDirect
•   In a standard plug & play OS, the two drivers have separate DMA buffers in system memory
•   Three transfers to move data between I/O endpoint and GPU

                                                                  1
                                                                           I/O driver
                                                                             space
                    I/O          PCIe        CPU        System              2
                  endpoint      switch                  memory
                                                                           GPU driver
                                                                             space
                                                                  3
                                            GPU
                                 GPU
                                           memory




              1     I/O endpoint DMA’s into system memory

              2     CPU copies data from I/O driver DMA buffer into GPU DMA buffer

              3     GPU DMA’s from system memory into GPU memory
GPUDirect RDMA
•   I/O endpoint and GPU communicate directly, only one transfer required.
•   Traffic limited to PCIe switch, no CPU involvement in DMA
•   x86 CPU is still necessary to have in the system, to run NVIDIA driver




                              I/O           PCIe          CPU        System
                            endpoint       switch                    memory


                                       1

                                                         GPU
                                            GPU
                                                        memory




                        1     I/O endpoint DMA’s into GPU memory
Endpoints
•   GPUDirect RDMA is flexible and works with a wide range of existing devices
     –   Built on open PCIe standards
     –   Any I/O device that has a PCIe endpoint and DMA engine can utilize GPUDirect RDMA
     – GPUDirect permeates both the frontend ingest and backend interconnects

                 •   FPGAs
                 •   Ethernet / InfiniBand adapters
                 •   Storage devices
                 •   Video capture cards
                 •   PCIe non-transparent (NT) ports


•   It’s free.
     –   Supported in CUDA 5.0 and Kepler
     –   Users can leverage APIs to implement RDMA with 3rd-party endpoints in their system


•   Practically no integration required
     –   No changes to device HW
     –   No changes to CUDA algorithms
     –   I/O device drivers need to use DMA addresses of GPU instead of SYSRAM pages
Frontend Ingest
                                IPN251
                                    8GB                                     8GB
                                                          INTEL
                                    DDR3                                    DDR3
                                                        Ivybridge
                                                        quad-core
                                  ICS1572                                  384-core
                                    XMC                                      GPU
                                                 gen1               gen3
                                   XYLINX         x8      PCIe       x16   NVIDIA
                                   Virtex6               switch            Kepler
                                                              gen3 x8
                                   A      A                                 2GB
                                   D      D         MELLANOX
                                   C      C         ConnectX-3             GDDR5



                                   RF signals    InfiniBand   10GigE



                             DMA latency (µs)                                             DMA throughput (MB/s)
     DMA Size
                   no RDMA             with RDMA                    Δ                 no RDMA              with RDMA

        16 KB     65.06 µs             4.09 µs                ↓15.9X                  125 MB/s           2000 MB/s

        32 KB     77.38 µs             8.19 µs                 ↓9.5X                  211 MB/s           2000 MB/s

        64 KB    124.03 µs          16.38 µs                   ↓7.6X                  264 MB/s           2000 MB/s

      128 KB     208.26 µs          32.76 µs                   ↓6.4X                  314 MB/s           2000 MB/s

      256 KB     373.57 µs          65.53 µs                   ↓5.7X                  350 MB/s           2000 MB/s

      512 KB     650.52 µs        131.07 µs                    ↓5.0X                  402 MB/s           2000 MB/s

     1024 KB    1307.90 µs        262.14 µs                    ↓4.9X                  400 MB/s           2000 MB/s

     2048 KB    2574.33 µs        524.28 µs                    ↓4.9X                  407 MB/s           2000 MB/s
Frontend Ingest
                                IPN251
                                    8GB                                     8GB
                                                          INTEL
                                    DDR3                                    DDR3
                                                        Ivybridge
                                                        quad-core
                                  ICS1572                                  384-core
                                    XMC                                      GPU
                                                 gen1               gen3
                                   XYLINX         x8      PCIe       x16   NVIDIA
                                   Virtex6               switch            Kepler
                                                              gen3 x8
                                   A      A                                 2GB
                                   D      D         MELLANOX
                                   C      C         ConnectX-3             GDDR5



                                   RF signals    InfiniBand   10GigE



                             DMA latency (µs)                                             DMA throughput (MB/s)
     DMA Size
                   no RDMA             with RDMA                    Δ                 no RDMA              with RDMA

        16 KB     65.06 µs             4.09 µs                ↓15.9X                  125 MB/s           2000 MB/s

        32 KB     77.38 µs             8.19 µs                 ↓9.5X                  211 MB/s           2000 MB/s

        64 KB    124.03 µs          16.38 µs                   ↓7.6X                  264 MB/s           2000 MB/s

      128 KB     208.26 µs          32.76 µs                   ↓6.4X                  314 MB/s           2000 MB/s

      256 KB     373.57 µs          65.53 µs                   ↓5.7X                  350 MB/s           2000 MB/s

      512 KB     650.52 µs        131.07 µs                    ↓5.0X                  402 MB/s           2000 MB/s

     1024 KB    1307.90 µs        262.14 µs                    ↓4.9X                  400 MB/s           2000 MB/s

     2048 KB    2574.33 µs        524.28 µs                    ↓4.9X                  407 MB/s           2000 MB/s
Pipeline with GPUDirect RDMA
FPGA
         block 0                 block 1                block 2        block 3        block
ADC
FPGA
               block 0                 block 1               block 2        block 3       b
DMA
 GPU
                                  block 0               block 1        block 2        block
CUDA
GPU
                                                        block 0        block 1        block
DMA




        FPGA
                   Transfer block directly to GPU via PCIe switch
        DMA
         GPU       CUDA DSP kernels (FIR, FFT, ect.)
        CUDA
        GPU
                   Transfer results to next processor
        DMA
Pipeline without GPUDirect RDMA
FPGA
         block 0                 block 1                block 2             block 3             block
ADC
FPGA
               block 0                 block 1               block 2             block 3            b
DMA
GPU                                      block 0                       block 1
DMA
 GPU
                                                                  block 0                       block
CUDA
GPU
                                                                                      block 0
DMA



        FPGA
                   Transfer block to system memory
        DMA
        GPU
                   Transfer from system memory to GPU
        DMA
         GPU       CUDA DSP kernels (FIR, FFT, ect.)
        CUDA
        GPU
                   Transfer results to next processor
        DMA
Backend Interconnects
•   Utilize GPUDirect RDMA across the network for low-latency IPC and system scalability




•   Mellanox OFED integration with GPUDirect RDMA – Q2 2013
Topologies
 •   GPUDirect RDMA works in many system topologies

      –   Single I/O endpoint and single GPU
      –   Single I/O endpoint and multiple GPUs
                                                                with or without PCIe switch downstream of CPU
      –   Multiple I/O endpoints and single GPU
      –   Multiple I/O endpoints and multiple GPUs


                                                                                 System
                                                                CPU
                                                                                 memory




                                                              PCIe
                                                             switch




                                                       I/O              I/O
                                        GPU                                          GPU
                                                     endpoint         endpoint




                                       GPU                                           GPU
                                      memory                                        memory
                                                     stream           stream


                                       DMA pipeline #1                   DMA pipeline #2
Impacts of GPUDirect
 •   Decreased latency
      –   Eliminate redundant copies over PCIe + added latency from CPU
      –   ~5x reduction, depending on the I/O endpoint
      –   Perform round-trip operations on GPU in microseconds, not milliseconds




                                enables new CUDA applications

 •   Increased PCIe efficiency + bandwidth
      –   bypass system memory, MMU, root complex: limiting factors for GPU DMA transfers

 •   Decrease in CPU utilization
      –   CPU is no longer burning cycles shuffling data around for the GPU
      –   System memory is no longer being thrashed by DMA engines on endpoint + GPU
      –   Go from 100% core utilization per GPU to < 10% utilization per GPU




                                     promotes multi-GPU
Before GPUDirect…
•   Many multi-GPU systems had 2 GPU nodes
•   Additional GPUs quickly choked system memory and CPU resources

•   Dual-socket CPU design very common
•   Dual root-complex prevents P2P across CPUs (QPI/HT untraversable)


                                           System     System
                                           memory     memory



                               I/O
                                            CPU         CPU
                             endpoint




                                            GPU         GPU




                                           GFLOPS/watt
                                        Xeon E5     K20X          system
                      SGEMM              2.32       12.34            8.45
                      DGEMM              1.14         5.19           3.61
Rise of Multi-GPU
 •   Higher GPU-to-CPU ratios permitted by increased GPU autonomy from GPUDirect
 •   PCIe switches integral to design, for true CPU bypass and fully-connected P2P



                                                                 System
                                                   CPU
                                                                 memory



                                  I/O              PCIe
                                endpoint          switch




                                 GPU        GPU            GPU       GPU




                                           GFLOPS/watt
                       GPU:CPU ratio            1 to 1              4 to 1
                         SGEMM                    8.45               10.97
                         DGEMM                    3.61                4.64
Nested PCIe switches
 •   Nested hierarchies avert 96-lane limit on current PCIe switches


                                                                 System
                                                   CPU
                                                                 memory



                                                   PCIe
                                                  switch




          I/O               PCIe                                              PCIe               I/O
        endpoint           switch                                            switch            endpoint




          GPU        GPU            GPU     GPU            GPU         GPU            GPU       GPU




                                                  GFLOPS/watt
                        GPU:CPU ratio           1 to 1            4 to 1              8 to 1
                           SGEMM                  8.45            10.97                11.61
                           DGEMM                  3.61             4.64                 4.91
PCIe over Fiber
     •     SWaP is our new limiting factor
     •     PCIe over Fiber-Optic can interconnect expansion blades and assure GPU:CPU growth
     •     Supports PCIe gen3 over at least 100 meters

                                                                                 system
                                                                    CPU          memory



                                                                   PCIe
                                                                  switch


                                                              PCIe over Fiber

                            1                                     2                                           3                             4
  I/O         PCIe                 I/O         PCIe                          I/O           PCIe                     I/O         PCIe
endpoint     switch              endpoint     switch                       endpoint       switch                  endpoint     switch




 GPU       GPU    GPU    GPU     GPU        GPU    GPU        GPU          GPU        GPU      GPU      GPU       GPU        GPU    GPU   GPU




                                                                GFLOPS/watt
                         GPU:CPU ratio             1 to 1                 4 to 1                   8 to 1           16 to 1
                          SGEMM                        8.45               10.97                    11.61             11.96
                          DGEMM                        3.61                 4.64                    4.91                5.04
Scalability – 10 petaflops
                Processor Power Consumption for 10 petaflops
    GPU:CPU ratio     1 to 1          4 to 1          8 to 1      16 to 1
     SGEMM           1184 kW         911 kW           862 kW       832 kW
     DGEMM           2770 kW         2158 kW        2032 kW      1982 kW

                               Yearly Energy Bill
    GPU:CPU ratio     1 to 1          4 to 1          8 to 1      16 to 1
     SGEMM          $1,050,326      $808,148        $764,680     $738,067
     DGEMM          $2,457,266     $1,914,361       $1,802,586   $1,758,232


                               Efficiency Savings
    GPU:CPU ratio     1 to 1          4 to 1          8 to 1      16 to 1
     SGEMM              --           23.05%          27.19%       29.73%
     DGEMM              --           22.09%          26.64%       28.45%
Road to Exascale

       GPUDirect RDMA                Project Denver
          Integration with             Hybrid CPU/GPU
    peripherals & interconnects              for
       for system scalability      heterogeneous compute




        Project Osprey                  Software

    Parallel microarchitecture &   CUDA, OpenCL, OpenACC
     fab process optimizations      Drivers & Mgmt Layer
        for power efficiency             Hybrid O/S
Project Denver
 •   64-bit ARMv8 architecture
 •   ISA by ARM, chip by NVIDIA
 •   ARM’s RISC-based approach aligns with NVIDIA’s perf/Watt initiative




 •   Unlike licensed Cortex-A53 and –A57 cores, NVIDIA’s cores are highly customized
 •   Design flexibility required for tight CPU/GPU integration
ceepie-geepie

                Memory controller          Memory controller         Memory controller



                                    NPN240
       h.264

                    ARM




                                                                                         DVI
                                     SM                  SM              SM

                    ARM
       crypto




                                                                                         DVI
                      L2                              L2 cache
       SATA




                                                                                         DVI
                    ARM

                                     SM                  SM              SM
       USB




                                                                                         DVI
                    ARM



                                     Grid management + scheduler

                                      PCIe root complex / endpoint
Mem ctrl         Mem ctrl       Mem ctrl

ceepie-geepie




                   crypto h.264
                                   ARM       NPN




                                                                             DVI
                                             SM          SM        SM
                                   ARM       240




                                                                             DVI
                                    L2               L2 cache




                   SATA




                                                                             DVI
                                   ARM
                                              SM         SM        SM




                   USB




                                                                             DVI
                                   ARM

                                         Grid management + scheduler
                                              PCIe root complex


          I/O                                       PCIe
        endpoint                                   switch



                                                PCIe endpoint
                                         Grid management + scheduler
                   crypto h.264




                                   ARM       NPN



                                                                             DVI
                                             SM         SM         SM
                                   ARM       240


                                                                             DVI
                                    L2               L2 cache
                   SATA




                                   ARM                                       DVI
                                              SM        SM         SM
                   USB




                                                                             DVI

                                   ARM

                                  Mem ctrl         Mem ctrl       Mem ctrl
Impacts of Project Denver
 •   Heterogeneous compute
      –   share mixed work efficiently between CPU/GPU
      –   unified memory subsystem


 •   Decreased latency
      –   no PCIe required for synchronization + command queuing


 •   Power efficiency
      –   ARMv8 ISA
      –   on-chip buses & interconnects
      –   “4+1” power scaling


 •   Natively boot & run operating systems

 •   Direct connectivity with peripherals

 •   Maximize multi-GPU
Project Osprey
 •   DARPA PERFECT – 75 GFLOPS/watt
Summary
• System-wide integration of GPUDirect RDMA
• Increase GPU:CPU ratio for better GFLOPS/watt
• Utilize PCIe switches instead of dual-socket CPU/IOH


• Exascale compute – what’s good for HPC is good for embedded/mobile
• Denver & Osprey – what’s good for embedded/mobile is good for HPC



                           questions?

                            booth 201

Contenu connexe

Tendances

Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsJignesh Shah
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
DevConf 2014 Kernel Networking Walkthrough
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking WalkthroughThomas Graf
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
 
BKK16-208 EAS
BKK16-208 EASBKK16-208 EAS
BKK16-208 EASLinaro
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking ExplainedThomas Graf
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningMilind Koyande
 
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)Satoshi Shimazaki
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadKevin Traynor
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KernelThomas Graf
 

Tendances (20)

Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Dpdk pmd
Dpdk pmdDpdk pmd
Dpdk pmd
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
DevConf 2014 Kernel Networking Walkthrough
DevConf 2014   Kernel Networking WalkthroughDevConf 2014   Kernel Networking Walkthrough
DevConf 2014 Kernel Networking Walkthrough
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
BKK16-208 EAS
BKK16-208 EASBKK16-208 EAS
BKK16-208 EAS
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
 
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
OSC2011 Tokyo/Spring 自宅SAN友の会(前半)
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 

En vedette

Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer BuildPetteriTeikariPhD
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Datainside-BigData.com
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of ComputingIntel Nervana
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntel Nervana
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 

En vedette (20)

Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer Build
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Data
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will Constable
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 

Similaire à GPUDirect RDMA and Green Multi-GPU Architectures

Case Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded ProcessorsCase Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded Processorsaccount inactive
 
Power Mac G5 ( Late 2005) Technical Specifications
Power  Mac  G5 ( Late 2005)    Technical  SpecificationsPower  Mac  G5 ( Late 2005)    Technical  Specifications
Power Mac G5 ( Late 2005) Technical SpecificationsSocial Media Marketing
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
ArcGIS Server a Brief Synopsis
ArcGIS Server a Brief SynopsisArcGIS Server a Brief Synopsis
ArcGIS Server a Brief Synopsisewug
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsAnalog Devices, Inc.
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhSaurabh Kumar
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Presentation power point (Advertising Upgrade))
Presentation power point (Advertising Upgrade))Presentation power point (Advertising Upgrade))
Presentation power point (Advertising Upgrade))andrew maybir
 
Advertising System Upgrade
Advertising System UpgradeAdvertising System Upgrade
Advertising System Upgradeandrew maybir
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationxKinAnx
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
VMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
VMworld 2013: How Good is PCoIP - A Remoting Protocol ShootoutVMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
VMworld 2013: How Good is PCoIP - A Remoting Protocol ShootoutVMworld
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 

Similaire à GPUDirect RDMA and Green Multi-GPU Architectures (20)

PG-Strom
PG-StromPG-Strom
PG-Strom
 
Case Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded ProcessorsCase Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded Processors
 
Power Mac G5 ( Late 2005) Technical Specifications
Power  Mac  G5 ( Late 2005)    Technical  SpecificationsPower  Mac  G5 ( Late 2005)    Technical  Specifications
Power Mac G5 ( Late 2005) Technical Specifications
 
Power Mac G5 ( Late 2005) Technical Specifications
Power  Mac  G5 ( Late 2005)    Technical  SpecificationsPower  Mac  G5 ( Late 2005)    Technical  Specifications
Power Mac G5 ( Late 2005) Technical Specifications
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
ArcGIS Server a Brief Synopsis
ArcGIS Server a Brief SynopsisArcGIS Server a Brief Synopsis
ArcGIS Server a Brief Synopsis
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin Processors
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
 
Beagle board
Beagle boardBeagle board
Beagle board
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Presentation power point (Advertising Upgrade))
Presentation power point (Advertising Upgrade))Presentation power point (Advertising Upgrade))
Presentation power point (Advertising Upgrade))
 
Advertising System Upgrade
Advertising System UpgradeAdvertising System Upgrade
Advertising System Upgrade
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
Ibm power7
Ibm power7Ibm power7
Ibm power7
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
VMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
VMworld 2013: How Good is PCoIP - A Remoting Protocol ShootoutVMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
VMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

GPUDirect RDMA and Green Multi-GPU Architectures

  • 1. GPUDirect RDMA and Green Multi-GPU Architectures GE Intelligent Platforms Mil/Aero Embedded Computing Dustin Franklin, GPGPU Applications Engineer dustin.franklin@ge.com 443.310.9812 (Washington, DC)
  • 2. What this talk is about • GPU Autonomy • GFLOPS/watt and SWAP • Project Denver • Exascale
  • 3. Without GPUDirect • In a standard plug & play OS, the two drivers have separate DMA buffers in system memory • Three transfers to move data between I/O endpoint and GPU 1 I/O driver space I/O PCIe CPU System 2 endpoint switch memory GPU driver space 3 GPU GPU memory 1 I/O endpoint DMA’s into system memory 2 CPU copies data from I/O driver DMA buffer into GPU DMA buffer 3 GPU DMA’s from system memory into GPU memory
  • 4. GPUDirect RDMA • I/O endpoint and GPU communicate directly, only one transfer required. • Traffic limited to PCIe switch, no CPU involvement in DMA • x86 CPU is still necessary to have in the system, to run NVIDIA driver I/O PCIe CPU System endpoint switch memory 1 GPU GPU memory 1 I/O endpoint DMA’s into GPU memory
  • 5. Endpoints • GPUDirect RDMA is flexible and works with a wide range of existing devices – Built on open PCIe standards – Any I/O device that has a PCIe endpoint and DMA engine can utilize GPUDirect RDMA – GPUDirect permeates both the frontend ingest and backend interconnects • FPGAs • Ethernet / InfiniBand adapters • Storage devices • Video capture cards • PCIe non-transparent (NT) ports • It’s free. – Supported in CUDA 5.0 and Kepler – Users can leverage APIs to implement RDMA with 3rd-party endpoints in their system • Practically no integration required – No changes to device HW – No changes to CUDA algorithms – I/O device drivers need to use DMA addresses of GPU instead of SYSRAM pages
  • 6. Frontend Ingest IPN251 8GB 8GB INTEL DDR3 DDR3 Ivybridge quad-core ICS1572 384-core XMC GPU gen1 gen3 XYLINX x8 PCIe x16 NVIDIA Virtex6 switch Kepler gen3 x8 A A 2GB D D MELLANOX C C ConnectX-3 GDDR5 RF signals InfiniBand 10GigE DMA latency (µs) DMA throughput (MB/s) DMA Size no RDMA with RDMA Δ no RDMA with RDMA 16 KB 65.06 µs 4.09 µs ↓15.9X 125 MB/s 2000 MB/s 32 KB 77.38 µs 8.19 µs ↓9.5X 211 MB/s 2000 MB/s 64 KB 124.03 µs 16.38 µs ↓7.6X 264 MB/s 2000 MB/s 128 KB 208.26 µs 32.76 µs ↓6.4X 314 MB/s 2000 MB/s 256 KB 373.57 µs 65.53 µs ↓5.7X 350 MB/s 2000 MB/s 512 KB 650.52 µs 131.07 µs ↓5.0X 402 MB/s 2000 MB/s 1024 KB 1307.90 µs 262.14 µs ↓4.9X 400 MB/s 2000 MB/s 2048 KB 2574.33 µs 524.28 µs ↓4.9X 407 MB/s 2000 MB/s
  • 7. Frontend Ingest IPN251 8GB 8GB INTEL DDR3 DDR3 Ivybridge quad-core ICS1572 384-core XMC GPU gen1 gen3 XYLINX x8 PCIe x16 NVIDIA Virtex6 switch Kepler gen3 x8 A A 2GB D D MELLANOX C C ConnectX-3 GDDR5 RF signals InfiniBand 10GigE DMA latency (µs) DMA throughput (MB/s) DMA Size no RDMA with RDMA Δ no RDMA with RDMA 16 KB 65.06 µs 4.09 µs ↓15.9X 125 MB/s 2000 MB/s 32 KB 77.38 µs 8.19 µs ↓9.5X 211 MB/s 2000 MB/s 64 KB 124.03 µs 16.38 µs ↓7.6X 264 MB/s 2000 MB/s 128 KB 208.26 µs 32.76 µs ↓6.4X 314 MB/s 2000 MB/s 256 KB 373.57 µs 65.53 µs ↓5.7X 350 MB/s 2000 MB/s 512 KB 650.52 µs 131.07 µs ↓5.0X 402 MB/s 2000 MB/s 1024 KB 1307.90 µs 262.14 µs ↓4.9X 400 MB/s 2000 MB/s 2048 KB 2574.33 µs 524.28 µs ↓4.9X 407 MB/s 2000 MB/s
  • 8. Pipeline with GPUDirect RDMA FPGA block 0 block 1 block 2 block 3 block ADC FPGA block 0 block 1 block 2 block 3 b DMA GPU block 0 block 1 block 2 block CUDA GPU block 0 block 1 block DMA FPGA Transfer block directly to GPU via PCIe switch DMA GPU CUDA DSP kernels (FIR, FFT, ect.) CUDA GPU Transfer results to next processor DMA
  • 9. Pipeline without GPUDirect RDMA FPGA block 0 block 1 block 2 block 3 block ADC FPGA block 0 block 1 block 2 block 3 b DMA GPU block 0 block 1 DMA GPU block 0 block CUDA GPU block 0 DMA FPGA Transfer block to system memory DMA GPU Transfer from system memory to GPU DMA GPU CUDA DSP kernels (FIR, FFT, ect.) CUDA GPU Transfer results to next processor DMA
  • 10. Backend Interconnects • Utilize GPUDirect RDMA across the network for low-latency IPC and system scalability • Mellanox OFED integration with GPUDirect RDMA – Q2 2013
  • 11. Topologies • GPUDirect RDMA works in many system topologies – Single I/O endpoint and single GPU – Single I/O endpoint and multiple GPUs with or without PCIe switch downstream of CPU – Multiple I/O endpoints and single GPU – Multiple I/O endpoints and multiple GPUs System CPU memory PCIe switch I/O I/O GPU GPU endpoint endpoint GPU GPU memory memory stream stream DMA pipeline #1 DMA pipeline #2
  • 12. Impacts of GPUDirect • Decreased latency – Eliminate redundant copies over PCIe + added latency from CPU – ~5x reduction, depending on the I/O endpoint – Perform round-trip operations on GPU in microseconds, not milliseconds enables new CUDA applications • Increased PCIe efficiency + bandwidth – bypass system memory, MMU, root complex: limiting factors for GPU DMA transfers • Decrease in CPU utilization – CPU is no longer burning cycles shuffling data around for the GPU – System memory is no longer being thrashed by DMA engines on endpoint + GPU – Go from 100% core utilization per GPU to < 10% utilization per GPU promotes multi-GPU
  • 13. Before GPUDirect… • Many multi-GPU systems had 2 GPU nodes • Additional GPUs quickly choked system memory and CPU resources • Dual-socket CPU design very common • Dual root-complex prevents P2P across CPUs (QPI/HT untraversable) System System memory memory I/O CPU CPU endpoint GPU GPU GFLOPS/watt Xeon E5 K20X system SGEMM 2.32 12.34 8.45 DGEMM 1.14 5.19 3.61
  • 14. Rise of Multi-GPU • Higher GPU-to-CPU ratios permitted by increased GPU autonomy from GPUDirect • PCIe switches integral to design, for true CPU bypass and fully-connected P2P System CPU memory I/O PCIe endpoint switch GPU GPU GPU GPU GFLOPS/watt GPU:CPU ratio 1 to 1 4 to 1 SGEMM 8.45 10.97 DGEMM 3.61 4.64
  • 15. Nested PCIe switches • Nested hierarchies avert 96-lane limit on current PCIe switches System CPU memory PCIe switch I/O PCIe PCIe I/O endpoint switch switch endpoint GPU GPU GPU GPU GPU GPU GPU GPU GFLOPS/watt GPU:CPU ratio 1 to 1 4 to 1 8 to 1 SGEMM 8.45 10.97 11.61 DGEMM 3.61 4.64 4.91
  • 16. PCIe over Fiber • SWaP is our new limiting factor • PCIe over Fiber-Optic can interconnect expansion blades and assure GPU:CPU growth • Supports PCIe gen3 over at least 100 meters system CPU memory PCIe switch PCIe over Fiber 1 2 3 4 I/O PCIe I/O PCIe I/O PCIe I/O PCIe endpoint switch endpoint switch endpoint switch endpoint switch GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GFLOPS/watt GPU:CPU ratio 1 to 1 4 to 1 8 to 1 16 to 1 SGEMM 8.45 10.97 11.61 11.96 DGEMM 3.61 4.64 4.91 5.04
  • 17. Scalability – 10 petaflops Processor Power Consumption for 10 petaflops GPU:CPU ratio 1 to 1 4 to 1 8 to 1 16 to 1 SGEMM 1184 kW 911 kW 862 kW 832 kW DGEMM 2770 kW 2158 kW 2032 kW 1982 kW Yearly Energy Bill GPU:CPU ratio 1 to 1 4 to 1 8 to 1 16 to 1 SGEMM $1,050,326 $808,148 $764,680 $738,067 DGEMM $2,457,266 $1,914,361 $1,802,586 $1,758,232 Efficiency Savings GPU:CPU ratio 1 to 1 4 to 1 8 to 1 16 to 1 SGEMM -- 23.05% 27.19% 29.73% DGEMM -- 22.09% 26.64% 28.45%
  • 18. Road to Exascale GPUDirect RDMA Project Denver Integration with Hybrid CPU/GPU peripherals & interconnects for for system scalability heterogeneous compute Project Osprey Software Parallel microarchitecture & CUDA, OpenCL, OpenACC fab process optimizations Drivers & Mgmt Layer for power efficiency Hybrid O/S
  • 19. Project Denver • 64-bit ARMv8 architecture • ISA by ARM, chip by NVIDIA • ARM’s RISC-based approach aligns with NVIDIA’s perf/Watt initiative • Unlike licensed Cortex-A53 and –A57 cores, NVIDIA’s cores are highly customized • Design flexibility required for tight CPU/GPU integration
  • 20. ceepie-geepie Memory controller Memory controller Memory controller NPN240 h.264 ARM DVI SM SM SM ARM crypto DVI L2 L2 cache SATA DVI ARM SM SM SM USB DVI ARM Grid management + scheduler PCIe root complex / endpoint
  • 21. Mem ctrl Mem ctrl Mem ctrl ceepie-geepie crypto h.264 ARM NPN DVI SM SM SM ARM 240 DVI L2 L2 cache SATA DVI ARM SM SM SM USB DVI ARM Grid management + scheduler PCIe root complex I/O PCIe endpoint switch PCIe endpoint Grid management + scheduler crypto h.264 ARM NPN DVI SM SM SM ARM 240 DVI L2 L2 cache SATA ARM DVI SM SM SM USB DVI ARM Mem ctrl Mem ctrl Mem ctrl
  • 22. Impacts of Project Denver • Heterogeneous compute – share mixed work efficiently between CPU/GPU – unified memory subsystem • Decreased latency – no PCIe required for synchronization + command queuing • Power efficiency – ARMv8 ISA – on-chip buses & interconnects – “4+1” power scaling • Natively boot & run operating systems • Direct connectivity with peripherals • Maximize multi-GPU
  • 23. Project Osprey • DARPA PERFECT – 75 GFLOPS/watt
  • 24. Summary • System-wide integration of GPUDirect RDMA • Increase GPU:CPU ratio for better GFLOPS/watt • Utilize PCIe switches instead of dual-socket CPU/IOH • Exascale compute – what’s good for HPC is good for embedded/mobile • Denver & Osprey – what’s good for embedded/mobile is good for HPC questions? booth 201