SlideShare a Scribd company logo
1 of 24
GPU Architecture Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional programming courses Contrast conventional multicore CPU architecture with high level view of AMD and Nvidia GPU Architecture This lecture starts with a high level architectural view of all GPUs, discusses each vendor’s architecture and then converges back to the OpenCL spec Stress on the difference between the AMD VLIW architecture and Nvidia scalar architecture Also discuss the different memory architecture Brief discussion of ICD and compilation flow of OpenCL provides a lead to Lecture 5 where the first complete OpenCL program is written 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Topics Mapping the OpenCL spec to many-core hardware  AMD GPU Architecture Nvidia GPU Architecture Cell Broadband Engine OpenCL Specific Topics OpenCL Compilation System Installable Client Driver (ICD) 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Motivation Why are we discussing vendor specific hardware if OpenCL is platform independent ? Gain intuition of how a program’s loops and data need to map to OpenCL kernels in order to obtain performance Observe similarities and differences between Nvidia and AMD hardware Understanding hardware will allow for platform specific tuning of code in later lectures 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Conventional CPU Architecture Space devoted  to control logic instead of  ALU CPUs are optimized to minimize the latency of a single thread Can efficiently handle control flow intensive workloads Multi level caches used to hide latency Limited number of registers due to smaller number of active threads Control logic to reorder execution, provide ILP and minimize pipeline stalls Conventional CPU Block Diagram Control Logic L2 Cache L3 Cache ALU L1 Cache  ~ 25GBPS System Memory A present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Modern GPGPU Architecture Generic many core GPU Less space devoted to control logic and caches Large register files to support multiple thread contexts Low latency hardware managed thread switching Large number of ALU per “core” with small user managed cache per core  Memory bus optimized for  bandwidth  ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously High Bandwidth  bus to ALUs On Board System Memory Simple ALUs Cache 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD GPU Hardware Architecture AMD 5870 – Cypress 20  SIMD engines 16 SIMD units per core 5 multiply-adds per functional unit (VLIW processing) 2.72 Teraflops Single Precision 544 Gigaflops Double Precision Source:  Introductory OpenCL SAAHPC2010, Benedict R. Gaster 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMD Engine One SIMD Engine A SIMD engine consists of a set of “Stream Cores” Stream cores arranged as a five way Very Long Instruction Word (VLIW) processor  Up to five scalar operations can be issued in a VLIW instruction Scalar operations executed on each processing element Stream cores within compute unit execute same VLIW instruction The block of work-items that are executed together is called a wavefront. 64 work items for 5870 One Stream Core Instruction and Control Flow T-Processing  Element Branch Execution Unit Processing Elements General Purpose Registers Source:  AMD Accelerated Parallel Processing OpenCL Programming Guide 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Platform as seen in OpenCL Individual work-items execute on a single processing element Processing element refers to a single VLIW core Multiple work-groups execute on a compute unit A compute unit refers to a SIMD Engine 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD GPU Memory Architecture Memory per compute unit Local data store (on-chip) Registers  L1 cache (8KB for 5870) per compute unit L2 Cache shared between compute units (512KB for 5870) Fast path for only 32 bit operations Complete path for atomics and < 32bit operations SIMD Engine  LDS, Registers L1 Cache Compute Unit to Memory X-bar L2 Cache Write Cache Atomic Path LDS 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Memory Model in OpenCL Subset of hardware memory exposed in OpenCL Local Data Share (LDS) exposed as local memory Share data between items of a work group designed to increase performance High Bandwidth access per SIMD Engine Private memory utilizes registers per work item Constant Memory __constant tags utilize L1 cache. Private  Memory Private  Memory Private  Memory Private  Memory Workitem 1 Workitem 1 Workitem 1 Workitem 1 Compute Unit 1 Compute Unit  N Local Memory Local Memory Global / Constant Memory Data Cache Compute Device Global Memory Compute Device Memory 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Constant Memory Usage Constant Memory declarations for AMD GPUs only beneficial for following access patterns Direct-Addressing Patterns: For non array constant values where the address is known initially Same Index Patterns: When all work-items reference the same constant address Globally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if  less than 16KB Cases where each work item accesses different indices, are not cached and deliver the same performance as a global memory read Source:  AMD Accelerated Parallel Processing OpenCL Programming Guide 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia GPUs - Fermi Architecture  Instruction Cache Core Core Core Core GTX 480 - Compute 2.0 capability 15 cores or Streaming Multiprocessors (SMs) Each SM features 32 CUDA processors 480  CUDA processors Global memory  with ECC Warp Scheduler  Warp Scheduler  Dispatch Unit Dispatch Unit Core Core Core Core Register File 32768 x 32bit LDST LDST Core Core Core Core SFU LDST LDST Core Core Core Core LDST LDST SFU Core Core Core Core LDST LDST SFU LDST LDST Core Core Core Core CUDA Core Dispatch Port LDST LDST Operand Collector Core Core Core Core SFU LDST LDST Source: NVIDIA’s Next Generation CUDA Architecture Whitepaper Interconnect Memory FP Unit Int Unit Core Core Core Core LDST LDST L1 Cache / 64kB Shared Memory L2 Cache Result Queue 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia GPUs – Fermi Architecture SM  executes threads in groups of 32 called warps. Two warp issue units per SM Concurrent kernel execution Execute multiple  kernels simultaneously to improve efficiency CUDA core consists of a single ALU and floating point unit FPU Instruction Cache Core Core Core Core Warp Scheduler  Warp Scheduler  Dispatch Unit Dispatch Unit Core Core Core Core Register File 32768 x 32bit LDST LDST Core Core Core Core SFU LDST LDST Core Core Core Core LDST LDST SFU Core Core Core Core LDST LDST SFU LDST LDST Core Core Core Core CUDA Core Dispatch Port LDST LDST Operand Collector Core Core Core Core SFU Source: NVIDIA’s Next Generation CUDA Compute Architecture Whitepaper LDST LDST Interconnect Memory FP Unit Int Unit Core Core Core Core LDST LDST L1 Cache / 64kB Shared Memory L2 Cache Result Queue 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMT and SIMD SIMT denotes scalar instructions and multiple threads sharing an instruction stream HW determines instruction stream sharing across ALUs E.g. NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures (“wavefronts”) where all the threads in a warp /wavefront proceed in lockstep Divergence between threads handled using predication SIMT instructions specify the execution and branching behavior of a single thread SIMD instructions exposes vector width,  E.g. of SIMD: explicit vector instructions like x86 SSE 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMT Execution Model ,[object Object]
ALUs all execute the same instruction
Pipelining is used to break instruction into phases
When first instruction completes (4 cycles here), the next instruction is ready to executeSIMD Width Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul … Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Wavefront Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul … Cycle 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia Memory Hierarchy L1 cache per SM configurable to support shared memory and caching of  global memory 48 KB Shared / 16 KB of L1 cache 16 KB Shared / 48 KB of L1 cache Data shared between work items of a group  using shared memory Each SM has a 32K register bank  L2 cache (768KB) that services all operations (load, store and texture) Unified path to global for loads and stores Registers Thread Block L1 Cache Shared Memory L2 Cache Global Memory 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia Memory Model in OpenCL Like AMD, a subset of hardware memory exposed in OpenCL Configurable shared memory is usable as local memory  Local memory used to share data between items of a work group at lower latency than global memory  Private memory utilizes registers per work item Private  Memory Private  Memory Private  Memory Private  Memory Workitem 1 Workitem 1 Workitem 1 Workitem 1 Compute Unit 1 Compute Unit  N Local Memory Local Memory Global / Constant Memory Data Cache Compute Device Global Memory Compute Device Memory 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Cell Broadband Engine SPE 2 SPE 0 SPE 1 SPE 3 Developed by Sony, Toshiba, IBM Transitioned from embedded platforms into HPC via the Playstation 3 OpenCL drivers available for Cell Bladecenter servers Consists of a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPE) Uses the IBM XL C for OpenCL compiler SPU SPU SPU SPU LS LS LS LS 25 GBPS 25 GBPS 25 GBPS Element Interconnect ~ 200GBPS LS = Local store per SPE of 256KB Memory & Interrupt Controller L1 and L2 Cache POWER PC PPE Source: http://www.alphaworks.ibm.com/tech/opencl 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Cell BE and OpenCL Cell Power/VMX CPU used as a CL_DEVICE_TYPE_CPU Cell SPU (CL_DEVICE_TYPE_ACCELERATOR)  No. of compute units on a SPU accelerator device is <=16 Local memory size <= 256KB 256K of local storage divided among OpenCL kernel, 8KB global data cache, local, constant and private variables OpenCL accelerator devices, and OpenCL CPU device share a common memory bus Provides extensions like “Device Fission” and “Migrate Objects” to specify where an object resides (discussed in Lecture 10) No support for OpenCL images, sampler objects, atomics and  byte addressable memory Source: http://www.alphaworks.ibm.com/tech/opencl 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
An Optimal GPGPU Kernel From the discussion on hardware we see that an ideal kernel for a GPU: Has thousands of independent pieces of work Uses all available compute units Allows interleaving for latency hiding Is amenable to instruction stream sharing Maps to SIMD execution by preventing divergence between work items Has high arithmetic intensity Ratio of math operations to memory access is high Not limited by memory bandwidth Note that these caveats apply to all GPUs 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

More Related Content

What's hot

Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit pptNitesh Dubey
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache AMD
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device driversHoucheng Lin
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUGlobalLogic Ukraine
 
Scheduler activations
Scheduler activationsScheduler activations
Scheduler activationsVin Voro
 
I2C Subsystem In Linux-2.6.24
I2C Subsystem In Linux-2.6.24I2C Subsystem In Linux-2.6.24
I2C Subsystem In Linux-2.6.24Varun Mahajan
 
SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture Abdullaziz Tagawy
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreAMD
 

What's hot (20)

Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
 
3D V-Cache
3D V-Cache 3D V-Cache
3D V-Cache
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
Arm device tree and linux device drivers
Arm device tree and linux device driversArm device tree and linux device drivers
Arm device tree and linux device drivers
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPU
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Scheduler activations
Scheduler activationsScheduler activations
Scheduler activations
 
I2C Subsystem In Linux-2.6.24
I2C Subsystem In Linux-2.6.24I2C Subsystem In Linux-2.6.24
I2C Subsystem In Linux-2.6.24
 
Android IPC Mechanism
Android IPC MechanismAndroid IPC Mechanism
Android IPC Mechanism
 
CPU vs GPU Comparison
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparison
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Porting Android
Porting AndroidPorting Android
Porting Android
 
SR-IOV Introduce
SR-IOV IntroduceSR-IOV Introduce
SR-IOV Introduce
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
Video Drivers
Video DriversVideo Drivers
Video Drivers
 

Similar to Lec04 gpu architecture

gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxPranita602627
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4UniFabric
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 

Similar to Lec04 gpu architecture (20)

gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Corei7
Corei7Corei7
Corei7
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Nehalem
NehalemNehalem
Nehalem
 
corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
Intel Core i7 Processors
Intel Core i7 ProcessorsIntel Core i7 Processors
Intel Core i7 Processors
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Lec07 threading hw
Lec07 threading hwLec07 threading hw
Lec07 threading hw
 
Lec06 memory
Lec06 memoryLec06 memory
Lec06 memory
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 

More from Taras Zakharchenko

More from Taras Zakharchenko (7)

Lec13 multidevice
Lec13 multideviceLec13 multidevice
Lec13 multidevice
 
Lec11 timing
Lec11 timingLec11 timing
Lec11 timing
 
Lec09 nbody-optimization
Lec09 nbody-optimizationLec09 nbody-optimization
Lec09 nbody-optimization
 
Lec08 optimizations
Lec08 optimizationsLec08 optimizations
Lec08 optimizations
 
Lec05 buffers basic_examples
Lec05 buffers basic_examplesLec05 buffers basic_examples
Lec05 buffers basic_examples
 
Lec02 03 opencl_intro
Lec02 03 opencl_introLec02 03 opencl_intro
Lec02 03 opencl_intro
 
Lec12 debugging
Lec12 debuggingLec12 debugging
Lec12 debugging
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Lec04 gpu architecture

  • 1. GPU Architecture Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
  • 2. Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional programming courses Contrast conventional multicore CPU architecture with high level view of AMD and Nvidia GPU Architecture This lecture starts with a high level architectural view of all GPUs, discusses each vendor’s architecture and then converges back to the OpenCL spec Stress on the difference between the AMD VLIW architecture and Nvidia scalar architecture Also discuss the different memory architecture Brief discussion of ICD and compilation flow of OpenCL provides a lead to Lecture 5 where the first complete OpenCL program is written 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. Topics Mapping the OpenCL spec to many-core hardware AMD GPU Architecture Nvidia GPU Architecture Cell Broadband Engine OpenCL Specific Topics OpenCL Compilation System Installable Client Driver (ICD) 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Motivation Why are we discussing vendor specific hardware if OpenCL is platform independent ? Gain intuition of how a program’s loops and data need to map to OpenCL kernels in order to obtain performance Observe similarities and differences between Nvidia and AMD hardware Understanding hardware will allow for platform specific tuning of code in later lectures 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Conventional CPU Architecture Space devoted to control logic instead of ALU CPUs are optimized to minimize the latency of a single thread Can efficiently handle control flow intensive workloads Multi level caches used to hide latency Limited number of registers due to smaller number of active threads Control logic to reorder execution, provide ILP and minimize pipeline stalls Conventional CPU Block Diagram Control Logic L2 Cache L3 Cache ALU L1 Cache ~ 25GBPS System Memory A present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Modern GPGPU Architecture Generic many core GPU Less space devoted to control logic and caches Large register files to support multiple thread contexts Low latency hardware managed thread switching Large number of ALU per “core” with small user managed cache per core Memory bus optimized for bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously High Bandwidth bus to ALUs On Board System Memory Simple ALUs Cache 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. AMD GPU Hardware Architecture AMD 5870 – Cypress 20 SIMD engines 16 SIMD units per core 5 multiply-adds per functional unit (VLIW processing) 2.72 Teraflops Single Precision 544 Gigaflops Double Precision Source: Introductory OpenCL SAAHPC2010, Benedict R. Gaster 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. SIMD Engine One SIMD Engine A SIMD engine consists of a set of “Stream Cores” Stream cores arranged as a five way Very Long Instruction Word (VLIW) processor Up to five scalar operations can be issued in a VLIW instruction Scalar operations executed on each processing element Stream cores within compute unit execute same VLIW instruction The block of work-items that are executed together is called a wavefront. 64 work items for 5870 One Stream Core Instruction and Control Flow T-Processing Element Branch Execution Unit Processing Elements General Purpose Registers Source: AMD Accelerated Parallel Processing OpenCL Programming Guide 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. AMD Platform as seen in OpenCL Individual work-items execute on a single processing element Processing element refers to a single VLIW core Multiple work-groups execute on a compute unit A compute unit refers to a SIMD Engine 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. AMD GPU Memory Architecture Memory per compute unit Local data store (on-chip) Registers L1 cache (8KB for 5870) per compute unit L2 Cache shared between compute units (512KB for 5870) Fast path for only 32 bit operations Complete path for atomics and < 32bit operations SIMD Engine LDS, Registers L1 Cache Compute Unit to Memory X-bar L2 Cache Write Cache Atomic Path LDS 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. AMD Memory Model in OpenCL Subset of hardware memory exposed in OpenCL Local Data Share (LDS) exposed as local memory Share data between items of a work group designed to increase performance High Bandwidth access per SIMD Engine Private memory utilizes registers per work item Constant Memory __constant tags utilize L1 cache. Private Memory Private Memory Private Memory Private Memory Workitem 1 Workitem 1 Workitem 1 Workitem 1 Compute Unit 1 Compute Unit N Local Memory Local Memory Global / Constant Memory Data Cache Compute Device Global Memory Compute Device Memory 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. AMD Constant Memory Usage Constant Memory declarations for AMD GPUs only beneficial for following access patterns Direct-Addressing Patterns: For non array constant values where the address is known initially Same Index Patterns: When all work-items reference the same constant address Globally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if less than 16KB Cases where each work item accesses different indices, are not cached and deliver the same performance as a global memory read Source: AMD Accelerated Parallel Processing OpenCL Programming Guide 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Nvidia GPUs - Fermi Architecture Instruction Cache Core Core Core Core GTX 480 - Compute 2.0 capability 15 cores or Streaming Multiprocessors (SMs) Each SM features 32 CUDA processors 480 CUDA processors Global memory with ECC Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Core Core Core Core Register File 32768 x 32bit LDST LDST Core Core Core Core SFU LDST LDST Core Core Core Core LDST LDST SFU Core Core Core Core LDST LDST SFU LDST LDST Core Core Core Core CUDA Core Dispatch Port LDST LDST Operand Collector Core Core Core Core SFU LDST LDST Source: NVIDIA’s Next Generation CUDA Architecture Whitepaper Interconnect Memory FP Unit Int Unit Core Core Core Core LDST LDST L1 Cache / 64kB Shared Memory L2 Cache Result Queue 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Nvidia GPUs – Fermi Architecture SM executes threads in groups of 32 called warps. Two warp issue units per SM Concurrent kernel execution Execute multiple kernels simultaneously to improve efficiency CUDA core consists of a single ALU and floating point unit FPU Instruction Cache Core Core Core Core Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Core Core Core Core Register File 32768 x 32bit LDST LDST Core Core Core Core SFU LDST LDST Core Core Core Core LDST LDST SFU Core Core Core Core LDST LDST SFU LDST LDST Core Core Core Core CUDA Core Dispatch Port LDST LDST Operand Collector Core Core Core Core SFU Source: NVIDIA’s Next Generation CUDA Compute Architecture Whitepaper LDST LDST Interconnect Memory FP Unit Int Unit Core Core Core Core LDST LDST L1 Cache / 64kB Shared Memory L2 Cache Result Queue 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. SIMT and SIMD SIMT denotes scalar instructions and multiple threads sharing an instruction stream HW determines instruction stream sharing across ALUs E.g. NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures (“wavefronts”) where all the threads in a warp /wavefront proceed in lockstep Divergence between threads handled using predication SIMT instructions specify the execution and branching behavior of a single thread SIMD instructions exposes vector width, E.g. of SIMD: explicit vector instructions like x86 SSE 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16.
  • 17. ALUs all execute the same instruction
  • 18. Pipelining is used to break instruction into phases
  • 19. When first instruction completes (4 cycles here), the next instruction is ready to executeSIMD Width Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul … Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Wavefront Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul Add Mul … Cycle 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20. Nvidia Memory Hierarchy L1 cache per SM configurable to support shared memory and caching of global memory 48 KB Shared / 16 KB of L1 cache 16 KB Shared / 48 KB of L1 cache Data shared between work items of a group using shared memory Each SM has a 32K register bank L2 cache (768KB) that services all operations (load, store and texture) Unified path to global for loads and stores Registers Thread Block L1 Cache Shared Memory L2 Cache Global Memory 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21. Nvidia Memory Model in OpenCL Like AMD, a subset of hardware memory exposed in OpenCL Configurable shared memory is usable as local memory Local memory used to share data between items of a work group at lower latency than global memory Private memory utilizes registers per work item Private Memory Private Memory Private Memory Private Memory Workitem 1 Workitem 1 Workitem 1 Workitem 1 Compute Unit 1 Compute Unit N Local Memory Local Memory Global / Constant Memory Data Cache Compute Device Global Memory Compute Device Memory 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 22. Cell Broadband Engine SPE 2 SPE 0 SPE 1 SPE 3 Developed by Sony, Toshiba, IBM Transitioned from embedded platforms into HPC via the Playstation 3 OpenCL drivers available for Cell Bladecenter servers Consists of a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPE) Uses the IBM XL C for OpenCL compiler SPU SPU SPU SPU LS LS LS LS 25 GBPS 25 GBPS 25 GBPS Element Interconnect ~ 200GBPS LS = Local store per SPE of 256KB Memory & Interrupt Controller L1 and L2 Cache POWER PC PPE Source: http://www.alphaworks.ibm.com/tech/opencl 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 23. Cell BE and OpenCL Cell Power/VMX CPU used as a CL_DEVICE_TYPE_CPU Cell SPU (CL_DEVICE_TYPE_ACCELERATOR) No. of compute units on a SPU accelerator device is <=16 Local memory size <= 256KB 256K of local storage divided among OpenCL kernel, 8KB global data cache, local, constant and private variables OpenCL accelerator devices, and OpenCL CPU device share a common memory bus Provides extensions like “Device Fission” and “Migrate Objects” to specify where an object resides (discussed in Lecture 10) No support for OpenCL images, sampler objects, atomics and byte addressable memory Source: http://www.alphaworks.ibm.com/tech/opencl 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 24. An Optimal GPGPU Kernel From the discussion on hardware we see that an ideal kernel for a GPU: Has thousands of independent pieces of work Uses all available compute units Allows interleaving for latency hiding Is amenable to instruction stream sharing Maps to SIMD execution by preventing divergence between work items Has high arithmetic intensity Ratio of math operations to memory access is high Not limited by memory bandwidth Note that these caveats apply to all GPUs 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 25. OpenCL Compilation System LLVM - Low Level Virtual Machine Kernels compiled to LLVM IR Open Source Compiler Platform, OS independent Multiple back ends http://llvm.org OpenCL Compute Program LLVM Front-end LLVM IR AMD CAL IL x86 Nvidia PTX 22 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 26. Installable Client Driver ICD allows multiple implementations to co-exist Code only links to libOpenCL.so Application selects implementation at runtime Current GPU driver model does not easily allow multiple devices across manufacturers clGetPlatformIDs() and clGetPlatformInfo() examine the list of available implementations and select a suitable one Application libOpenCL.so atiocl.so Nvidia-opencl 23 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 27. Summary We have examined different many-core platforms and how they map onto the OpenCL spec An important take-away is that even though vendors have implemented the spec differently the underlying ideas for obtaining performance by a programmer remain consistent We have looked at the runtime compilation model for OpenCL to understand how programs and kernels for compute devices are created at runtime We have looked at the ICD to understand how an OpenCL application can choose an implementation at runtime Next Lecture Cover moving of data to a compute device and some simple but complete OpenCL examples 24 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Editor's Notes

  1. This point is important because this would be a common question while teaching an open platform agnostic programming
  2. Basic computer architecture points for multicore CPUs
  3. High level 10,000 feet view of what a GPU looks like, irrespective of whether its AMD’s or Nvidia’s
  4. Very AMD specific discussion of low-level GPU architecture
  5. Very AMD specific discussion of low-level GPU architectureDiscusses a single SIMD engine and the stream cores
  6. We converge back to the OpenCL terminology to understand how the AMD GPU maps onto the OpenCL processing elements
  7. A brief overview of the AMD GPU memory architecture (As per Evergreen Series)
  8. The mapping of the AMD GPU memory components to the OpenCL terminology.Architecturally this is similar for AMD and Nvidia except that each ones have their own vendor specific namesSimilar types of memory are mapped to local memory for both AMD and Nvidia
  9. Summary of the usage of constant memory.Important because there are a restricted set of cases where there is a hardware provided performance boost while using constant memoryThis will have greater context with a complete example which would have to be later.This slide is added in case some one is reading this while optimizing some application and needs device specific details
  10. Nvidia “Fermi” Architecture, High level overview.
  11. Architectural highlights of a SM in a Fermi GPUMention scalar nature of a CUDA core unlike AMD’s VLIW architecture
  12. The SIMT execution model of GPU threads. SIMD specifies vector width as in SSE. However the SIMT execution model doesn’t necessarily need to know the number of threads in a warp for a OpenCL program.The concept of a warp / wavefront is not within OpenCL.
  13. The SIMT Execution mode which shows how different threads execute the same instruction
  14. Nvidia specific GPU memory architecture. Main highlight is the configurable L1 : Shared size ratioL2 is not exposed in the OpenCL specification
  15. Similar to AMD in the sense that low latency memory which is the shared memory becomes OpenCL local memory
  16. Brief introductionon the Cell
  17. Brief overview of how the Cell’s memory architecture maps to OpenCLFor usage of the Cell in specific applications a high level view is given and Lec 10 discusses its special extensionsOptimizations in Lec 6-8 do not apply to the Cell because of its very different architecture
  18. Discusses an optimal kernel to show how irrespective of the different underlying architecture, an optimum program for both AMD and Nvidia would have similar characteristics
  19. Explains how platform agnostic OpenCL code is mapped to a device specific Instruction Set Architecture.
  20. The ICD is added in order to explain how we can interface different OpenCL implementations with a similar compilation tool-chain