SlideShare a Scribd company logo
1 of 39
GPU Algorithms and Trends
Presentation, Mid 2018
Contents
• Why GPU ?
• Evolution of the GPU and its Programming models
• Typical Algorithms
• Image processing, Image Analysis, DB, VR, Graphics + Compute, Crypto
• Deep learning
• Bandwidth/ performance analysis tools
• Trends in GPU algorithms
• A journey, not a deal
Why GPU ?
Graphics Hardware Landscape
What does the GPU do ?
• Efficient Graphics processing
• High quality – Advanced shaders (Programmable)
• High efficiency – Discard unwanted pixels (Hardware)
• Co-processing with CPU
• Goals for the 2018+ Graphics Processor and beyond
• How can we keep the CPU at 0%, and the GPU 100% ?
• In other words, keep data-saturated, not data-starved
A historical perspective
• Embedded Graphics GPUs (and APIs pre-vulkan)
• Non-existent communication between different blocks (Blackbox)
• Non-existent heterogeneity between CPU and GPU
• GPU output  Optimised only for display scanout (exceptions – video streaming ..)
• Desktop GPUs (and APIs)
• Focused on Higher quality via Programmability
• Driven largely by Microsoft DirectX APIs, followed by OpenGL
From Pixels to FLOPs
• Need for controlling individual blocks, and manage individual contexts
better on Desktop APIs, led to more and more programmable cores
• Less load on (dynamic) drivers, more (one-time) load on application
• Target 0% CPU API overhead, 100% GPU loading
HW architecture advances
• Graphics
• HDR, HEVC, Data compression
• Low-latency pre-emption
• Application specific HW units
• HDR
• Multi-GPU architectures (SLI, ..)
• Compute
• CUDA core micro-arch
• Memory hierarchies
• Thread-level pre-emption
• Common
• GDDR5 advances
• Memory controllers, interconnects
• Clocks, micro arch
• Board designs
• Fan/ Thermal designs/ Noise considerations
Compute Hardware Advances (continued)
Deep learning Hardware Landscape
https://www.forbes.com/sites/moorinsights/2017/03/03/a-machine-learning-landscape-where-amd-intel-nvidia-qualcomm-and-xilinx-ai-engines-live/#7c1a12fc742f
•“.. Understanding C Major took me 27
years …” - Illayaraja, composer
• AIVA technologies AI composer, available, today
Programming Models
• Languages
• CUDA
• Native language acceleration
• Numba
• C++ AMP
• New on the GPU
• Branching !
• Exceptions !
High level comparison
Power performance
• Nvidia power/performance
Power and Area
• Integrated chipsets
• AMD llano 32nm
• Discrete Nvidia GPU Area
Quick introduction to GPU Programming
Organisation of the code
• Main.cpp
• main()
• Timing measurement code (cudaEvent*..)
• CUDA Acceleration code - Kernel.cu
• Kernel wrappers
• CudaMem allocations
• Grid/block calculations
• Kernel calls
• Actual kernel
Moving an algorithm to GPU – Tips
• C++ file or .cu file ?
• 'cudaEvent_t': undeclared identifier
• Include cuda_runtime.h, not just cuda.h
• Dreaded - “0x4 unspecified launch failure”
• cudaOccupancyMaxPotentialBlockSize
• GridSize
• BlockSize
• Tool for memory bug-checks
• cuda-memcheck.exe
• %PROGFILES%NVIDIA GPU Computing ToolkitCUDAvx.ybin
• ========= Invalid __global__ write of size 4
• ========= at 0x000006b0 in ….
• CUDA errors can be resident so be aware of the API behaviour
• Errors reported by some APIs ex cudaThreadSynchronize() are previous errors !
• 1D large arrays (ex 1M entries) have issues
• Move to 2D
• Each kernel composed of “a Grid of Blocks of Threads”
Specifying shared memory
• Static
• Declared in kernel
• Dynamic shared memory allocation
• The shared memory allocation size per thread block must be specified (in
bytes) using an optional third execution configuration parameter in the kernel
call
• myKernel <<<grids,blocks, memsizeBytes>>>();
• How to synchronise shared memory accesses across threads ?
• __syncthreads() in the kernel
GPU Profiling
• CudaStreamCreate – Startup several seconds
• General DNNs – more Host to Device, than Device to Host
• Yolov3 analysis:
• 23% in add-bias
• 16% in shortcut
• 15% in normalize
• 12% in fill
• C:Program FilesNVIDIA GPU Computing ToolkitCUDAv9.1libnvvp
• Using CUDNN for batch-norm reduces this time by about 20%
After profiling:
Moved BatchNorm to CUDNN results in 50% reduction
27% shortcut kernel
19% fill kernel
13% activate array
13% copy
Improving performance
• 3 fundamental steps
• Profile
• Profile
• Profile
• Bottlenecks ?
• Read data
• Write data
• Compute
• Avoid stalls - utilize internal memory judiciously
• Memory transfer and computation should be done in parallel
• Increase utilization – Occupancy
• Utilise helper APIs “cudaOccupancyMaxPotentialBlockSize”
CUDA and CUDNN
• CUDNN is a library of functions, built using the CUDA API
• Focused on Neural networks
• Downloaded separately from CUDA kit
• What performance improvement does it bring ?
• Yolo with different options
Yolo – with different options (Tegra TK1)
0 5 10 15 20 25 30 35 40
CPU
CUDA
CUDNN
YOLOv2 Inference Time (Seconds) - Tegra TK1
CPU 39
CUDA 0.53
CUDNN 0.01
Algorithms on GPU
Data exploration
• 4 free parameters – Can model an elephant
• http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_and_reg
ularization
Medicine – Drug discovery
• AtomNet - structure-based, deep convolutional neural network
designed to predict the bioactivity of small molecules for drug
discovery applications. (Atomwise company)
• apply the convolutional concepts of feature locality and hierarchical
composition to the modeling of bioactivity and chemical interactions
Segmentation – Ex Tumors in Pancrea images
• Small organ segmentation
• Recurrent Saliency Transformation Network. The key innovation is a saliency
transformation module, which repeatedly converts the segmentation
probability map from the previous iteration as spatial weights and applies
these weights to the current iteration
Challenges – Availability of training data
• Significant challenge in object detection
• Why ?
• Solution - Synthetic data
• Image augmentation
• Lighting, transformations, transparency
• euclidaug
• Ray tracing
• Completely under our control
Challenges - Latency of Algorithms on GPU
• How to profile ? What tools ?
• Typical Graphics latencies
• VR example, framebuffer, display relation
• Compute - Average inference latency of Inception v2 with TF 1.5
• 33ms with batch size of 1
• 540ms with batch size of 32
• “GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed”
Emerging – Compute-In-Flash
• Syntiant, Mythic Analog NN Implementation on Flash
http://www.calit2.uci.edu/uploads/Media/Text/HOLLEMAN.pdf
Emerging – DL and Operating Systems
• Windows
• Linaro
• Intelligence is not a single thing
• A group of intelligences working together
• Attention, reasoning, processing speed, movement
• Information and Intelligence not always visual !!
Conclusion
• Religion and Spirituality
• Future trends
• “near-chip-memory”
• Better atomics
• Process technologies
• Truly heterogenous multi-core architectures
From IBM
Netscope
• http://ethereon.github.io/netscope/quickstart.html
• Tool for visualizing neural network architectures (or technically, any
directed acyclic graph). It currently supports Caffe's prototxt format.
Arrow - GDF
Visualisation H20 – from VW talk on analytics
• https://www.youtube.com/watch?v=-mBg-lFz5fQ
• VW – Use GPU for both – analysis+queries
What are we creating AI for ?
• Intelligence on earth
• Intelligence outside earth
• Space travel under 0-gravity
• Cardiovascular deterioration
• Decalcification
• Demineralisation of bones
• Muscular fitness
• Demineralisation recovery time high, perhaps not recoverable
• Reconaissance missions

More Related Content

What's hot

CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterJen Aman
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Sharad Agarwal
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakDatabricks
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Yahoo Developer Network
 
AI Chip Trends and Forecast
AI Chip Trends and ForecastAI Chip Trends and Forecast
AI Chip Trends and ForecastCastLabKAIST
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?Deepak Shankar
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Video Analytics At Scale: DL, CV, ML On Databricks Platform
Video Analytics At Scale: DL, CV, ML On Databricks PlatformVideo Analytics At Scale: DL, CV, ML On Databricks Platform
Video Analytics At Scale: DL, CV, ML On Databricks PlatformDatabricks
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
GPU power consumption and performance trends
GPU power consumption and performance trendsGPU power consumption and performance trends
GPU power consumption and performance trendsAlessio Villardita
 

What's hot (18)

CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark Cluster
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
Apache Hadoop India Summit 2011 talk "The Next Generation of Hadoop MapReduce...
 
AI Chip Trends and Forecast
AI Chip Trends and ForecastAI Chip Trends and Forecast
AI Chip Trends and Forecast
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Video Analytics At Scale: DL, CV, ML On Databricks Platform
Video Analytics At Scale: DL, CV, ML On Databricks PlatformVideo Analytics At Scale: DL, CV, ML On Databricks Platform
Video Analytics At Scale: DL, CV, ML On Databricks Platform
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
GPU power consumption and performance trends
GPU power consumption and performance trendsGPU power consumption and performance trends
GPU power consumption and performance trends
 

Similar to GPU Algorithms and trends 2018

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitecturesNomy059
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overviewRajiv Kumar
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkRed Hat Developers
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Subbu Rama
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Matthias Trapp
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++JetBrains
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningSri Ambati
 

Similar to GPU Algorithms and trends 2018 (20)

Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitectures
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
Android performance
Android performanceAndroid performance
Android performance
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech Talk
 
Cuda
CudaCuda
Cuda
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
 
Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...Performance Evaluation and Comparison of Service-based Image Processing based...
Performance Evaluation and Comparison of Service-based Image Processing based...
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 

More from Prabindh Sundareson

Synthetic Data and Graphics Techniques in Robotics
Synthetic Data and Graphics Techniques in RoboticsSynthetic Data and Graphics Techniques in Robotics
Synthetic Data and Graphics Techniques in RoboticsPrabindh Sundareson
 
Machine learning in the Indian Context - IEEE talk at SRM Institute
Machine learning in the Indian Context - IEEE talk at SRM InstituteMachine learning in the Indian Context - IEEE talk at SRM Institute
Machine learning in the Indian Context - IEEE talk at SRM InstitutePrabindh Sundareson
 
ICCE Asia 2017 - Program Outline
ICCE Asia 2017 - Program OutlineICCE Asia 2017 - Program Outline
ICCE Asia 2017 - Program OutlinePrabindh Sundareson
 
Call for Papers - ICCE Asia 2017
Call for Papers - ICCE Asia 2017Call for Papers - ICCE Asia 2017
Call for Papers - ICCE Asia 2017Prabindh Sundareson
 
Technology, Innovation - A Perspective
Technology, Innovation - A PerspectiveTechnology, Innovation - A Perspective
Technology, Innovation - A PerspectivePrabindh Sundareson
 
IEEE - Consumer Electronics Trends Opportunities (2015)
IEEE - Consumer Electronics Trends Opportunities (2015)IEEE - Consumer Electronics Trends Opportunities (2015)
IEEE - Consumer Electronics Trends Opportunities (2015)Prabindh Sundareson
 
GFX part 8 - Three.js introduction and usage
GFX part 8 - Three.js introduction and usageGFX part 8 - Three.js introduction and usage
GFX part 8 - Three.js introduction and usagePrabindh Sundareson
 
GFX Part 7 - Introduction to Rendering Targets in OpenGL ES
GFX Part 7 - Introduction to Rendering Targets in OpenGL ESGFX Part 7 - Introduction to Rendering Targets in OpenGL ES
GFX Part 7 - Introduction to Rendering Targets in OpenGL ESPrabindh Sundareson
 
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ESGFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ESPrabindh Sundareson
 
GFX Part 5 - Introduction to Object Transformations in OpenGL ES
GFX Part 5 - Introduction to Object Transformations in OpenGL ESGFX Part 5 - Introduction to Object Transformations in OpenGL ES
GFX Part 5 - Introduction to Object Transformations in OpenGL ESPrabindh Sundareson
 
GFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ESGFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ESPrabindh Sundareson
 
GFX Part 3 - Vertices and interactions in OpenGL
GFX Part 3 - Vertices and interactions in OpenGLGFX Part 3 - Vertices and interactions in OpenGL
GFX Part 3 - Vertices and interactions in OpenGLPrabindh Sundareson
 
GFX Part 2 - Introduction to GPU Programming
GFX Part 2 - Introduction to GPU ProgrammingGFX Part 2 - Introduction to GPU Programming
GFX Part 2 - Introduction to GPU ProgrammingPrabindh Sundareson
 
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specificationsGFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specificationsPrabindh Sundareson
 
John Carmack talk at SMU, April 2014 - Virtual Reality
John Carmack talk at SMU, April 2014 - Virtual RealityJohn Carmack talk at SMU, April 2014 - Virtual Reality
John Carmack talk at SMU, April 2014 - Virtual RealityPrabindh Sundareson
 
Gfx2014 Graphics Workshop - Lab manual
Gfx2014 Graphics Workshop - Lab manualGfx2014 Graphics Workshop - Lab manual
Gfx2014 Graphics Workshop - Lab manualPrabindh Sundareson
 

More from Prabindh Sundareson (20)

Synthetic Data and Graphics Techniques in Robotics
Synthetic Data and Graphics Techniques in RoboticsSynthetic Data and Graphics Techniques in Robotics
Synthetic Data and Graphics Techniques in Robotics
 
Work and Life
Work and Life Work and Life
Work and Life
 
Machine learning in the Indian Context - IEEE talk at SRM Institute
Machine learning in the Indian Context - IEEE talk at SRM InstituteMachine learning in the Indian Context - IEEE talk at SRM Institute
Machine learning in the Indian Context - IEEE talk at SRM Institute
 
Students Hackathon - 2017
Students Hackathon - 2017Students Hackathon - 2017
Students Hackathon - 2017
 
ICCE Asia 2017 - Program Outline
ICCE Asia 2017 - Program OutlineICCE Asia 2017 - Program Outline
ICCE Asia 2017 - Program Outline
 
Call for Papers - ICCE Asia 2017
Call for Papers - ICCE Asia 2017Call for Papers - ICCE Asia 2017
Call for Papers - ICCE Asia 2017
 
Technology, Innovation - A Perspective
Technology, Innovation - A PerspectiveTechnology, Innovation - A Perspective
Technology, Innovation - A Perspective
 
Open Shading Language (OSL)
Open Shading Language (OSL)Open Shading Language (OSL)
Open Shading Language (OSL)
 
IEEE - Consumer Electronics Trends Opportunities (2015)
IEEE - Consumer Electronics Trends Opportunities (2015)IEEE - Consumer Electronics Trends Opportunities (2015)
IEEE - Consumer Electronics Trends Opportunities (2015)
 
GFX part 8 - Three.js introduction and usage
GFX part 8 - Three.js introduction and usageGFX part 8 - Three.js introduction and usage
GFX part 8 - Three.js introduction and usage
 
GFX Part 7 - Introduction to Rendering Targets in OpenGL ES
GFX Part 7 - Introduction to Rendering Targets in OpenGL ESGFX Part 7 - Introduction to Rendering Targets in OpenGL ES
GFX Part 7 - Introduction to Rendering Targets in OpenGL ES
 
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ESGFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
GFX Part 6 - Introduction to Vertex and Fragment Shaders in OpenGL ES
 
GFX Part 5 - Introduction to Object Transformations in OpenGL ES
GFX Part 5 - Introduction to Object Transformations in OpenGL ESGFX Part 5 - Introduction to Object Transformations in OpenGL ES
GFX Part 5 - Introduction to Object Transformations in OpenGL ES
 
GFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ESGFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ES
 
GFX Part 3 - Vertices and interactions in OpenGL
GFX Part 3 - Vertices and interactions in OpenGLGFX Part 3 - Vertices and interactions in OpenGL
GFX Part 3 - Vertices and interactions in OpenGL
 
GFX Part 2 - Introduction to GPU Programming
GFX Part 2 - Introduction to GPU ProgrammingGFX Part 2 - Introduction to GPU Programming
GFX Part 2 - Introduction to GPU Programming
 
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specificationsGFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
 
John Carmack talk at SMU, April 2014 - Virtual Reality
John Carmack talk at SMU, April 2014 - Virtual RealityJohn Carmack talk at SMU, April 2014 - Virtual Reality
John Carmack talk at SMU, April 2014 - Virtual Reality
 
GFX2014 OpenGL ES Quiz
GFX2014 OpenGL ES QuizGFX2014 OpenGL ES Quiz
GFX2014 OpenGL ES Quiz
 
Gfx2014 Graphics Workshop - Lab manual
Gfx2014 Graphics Workshop - Lab manualGfx2014 Graphics Workshop - Lab manual
Gfx2014 Graphics Workshop - Lab manual
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

GPU Algorithms and trends 2018

  • 1. GPU Algorithms and Trends Presentation, Mid 2018
  • 2. Contents • Why GPU ? • Evolution of the GPU and its Programming models • Typical Algorithms • Image processing, Image Analysis, DB, VR, Graphics + Compute, Crypto • Deep learning • Bandwidth/ performance analysis tools • Trends in GPU algorithms • A journey, not a deal
  • 5. What does the GPU do ? • Efficient Graphics processing • High quality – Advanced shaders (Programmable) • High efficiency – Discard unwanted pixels (Hardware) • Co-processing with CPU • Goals for the 2018+ Graphics Processor and beyond • How can we keep the CPU at 0%, and the GPU 100% ? • In other words, keep data-saturated, not data-starved
  • 6. A historical perspective • Embedded Graphics GPUs (and APIs pre-vulkan) • Non-existent communication between different blocks (Blackbox) • Non-existent heterogeneity between CPU and GPU • GPU output  Optimised only for display scanout (exceptions – video streaming ..) • Desktop GPUs (and APIs) • Focused on Higher quality via Programmability • Driven largely by Microsoft DirectX APIs, followed by OpenGL
  • 7. From Pixels to FLOPs • Need for controlling individual blocks, and manage individual contexts better on Desktop APIs, led to more and more programmable cores • Less load on (dynamic) drivers, more (one-time) load on application • Target 0% CPU API overhead, 100% GPU loading
  • 8. HW architecture advances • Graphics • HDR, HEVC, Data compression • Low-latency pre-emption • Application specific HW units • HDR • Multi-GPU architectures (SLI, ..) • Compute • CUDA core micro-arch • Memory hierarchies • Thread-level pre-emption • Common • GDDR5 advances • Memory controllers, interconnects • Clocks, micro arch • Board designs • Fan/ Thermal designs/ Noise considerations
  • 10.
  • 11. Deep learning Hardware Landscape https://www.forbes.com/sites/moorinsights/2017/03/03/a-machine-learning-landscape-where-amd-intel-nvidia-qualcomm-and-xilinx-ai-engines-live/#7c1a12fc742f
  • 12. •“.. Understanding C Major took me 27 years …” - Illayaraja, composer • AIVA technologies AI composer, available, today
  • 13. Programming Models • Languages • CUDA • Native language acceleration • Numba • C++ AMP • New on the GPU • Branching ! • Exceptions !
  • 15. Power performance • Nvidia power/performance
  • 16. Power and Area • Integrated chipsets • AMD llano 32nm • Discrete Nvidia GPU Area
  • 17. Quick introduction to GPU Programming
  • 18. Organisation of the code • Main.cpp • main() • Timing measurement code (cudaEvent*..) • CUDA Acceleration code - Kernel.cu • Kernel wrappers • CudaMem allocations • Grid/block calculations • Kernel calls • Actual kernel
  • 19. Moving an algorithm to GPU – Tips • C++ file or .cu file ? • 'cudaEvent_t': undeclared identifier • Include cuda_runtime.h, not just cuda.h • Dreaded - “0x4 unspecified launch failure” • cudaOccupancyMaxPotentialBlockSize • GridSize • BlockSize • Tool for memory bug-checks • cuda-memcheck.exe • %PROGFILES%NVIDIA GPU Computing ToolkitCUDAvx.ybin • ========= Invalid __global__ write of size 4 • ========= at 0x000006b0 in …. • CUDA errors can be resident so be aware of the API behaviour • Errors reported by some APIs ex cudaThreadSynchronize() are previous errors ! • 1D large arrays (ex 1M entries) have issues • Move to 2D • Each kernel composed of “a Grid of Blocks of Threads”
  • 20. Specifying shared memory • Static • Declared in kernel • Dynamic shared memory allocation • The shared memory allocation size per thread block must be specified (in bytes) using an optional third execution configuration parameter in the kernel call • myKernel <<<grids,blocks, memsizeBytes>>>(); • How to synchronise shared memory accesses across threads ? • __syncthreads() in the kernel
  • 21. GPU Profiling • CudaStreamCreate – Startup several seconds • General DNNs – more Host to Device, than Device to Host • Yolov3 analysis: • 23% in add-bias • 16% in shortcut • 15% in normalize • 12% in fill • C:Program FilesNVIDIA GPU Computing ToolkitCUDAv9.1libnvvp • Using CUDNN for batch-norm reduces this time by about 20% After profiling: Moved BatchNorm to CUDNN results in 50% reduction 27% shortcut kernel 19% fill kernel 13% activate array 13% copy
  • 22. Improving performance • 3 fundamental steps • Profile • Profile • Profile • Bottlenecks ? • Read data • Write data • Compute • Avoid stalls - utilize internal memory judiciously • Memory transfer and computation should be done in parallel • Increase utilization – Occupancy • Utilise helper APIs “cudaOccupancyMaxPotentialBlockSize”
  • 23. CUDA and CUDNN • CUDNN is a library of functions, built using the CUDA API • Focused on Neural networks • Downloaded separately from CUDA kit • What performance improvement does it bring ? • Yolo with different options
  • 24. Yolo – with different options (Tegra TK1) 0 5 10 15 20 25 30 35 40 CPU CUDA CUDNN YOLOv2 Inference Time (Seconds) - Tegra TK1 CPU 39 CUDA 0.53 CUDNN 0.01
  • 26. Data exploration • 4 free parameters – Can model an elephant • http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_and_reg ularization
  • 27. Medicine – Drug discovery • AtomNet - structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. (Atomwise company) • apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions
  • 28. Segmentation – Ex Tumors in Pancrea images • Small organ segmentation • Recurrent Saliency Transformation Network. The key innovation is a saliency transformation module, which repeatedly converts the segmentation probability map from the previous iteration as spatial weights and applies these weights to the current iteration
  • 29. Challenges – Availability of training data • Significant challenge in object detection • Why ? • Solution - Synthetic data • Image augmentation • Lighting, transformations, transparency • euclidaug • Ray tracing • Completely under our control
  • 30. Challenges - Latency of Algorithms on GPU • How to profile ? What tools ? • Typical Graphics latencies • VR example, framebuffer, display relation • Compute - Average inference latency of Inception v2 with TF 1.5 • 33ms with batch size of 1 • 540ms with batch size of 32 • “GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed”
  • 31. Emerging – Compute-In-Flash • Syntiant, Mythic Analog NN Implementation on Flash http://www.calit2.uci.edu/uploads/Media/Text/HOLLEMAN.pdf
  • 32. Emerging – DL and Operating Systems • Windows • Linaro
  • 33. • Intelligence is not a single thing • A group of intelligences working together • Attention, reasoning, processing speed, movement • Information and Intelligence not always visual !!
  • 34. Conclusion • Religion and Spirituality • Future trends • “near-chip-memory” • Better atomics • Process technologies • Truly heterogenous multi-core architectures
  • 36. Netscope • http://ethereon.github.io/netscope/quickstart.html • Tool for visualizing neural network architectures (or technically, any directed acyclic graph). It currently supports Caffe's prototxt format.
  • 38. Visualisation H20 – from VW talk on analytics • https://www.youtube.com/watch?v=-mBg-lFz5fQ • VW – Use GPU for both – analysis+queries
  • 39. What are we creating AI for ? • Intelligence on earth • Intelligence outside earth • Space travel under 0-gravity • Cardiovascular deterioration • Decalcification • Demineralisation of bones • Muscular fitness • Demineralisation recovery time high, perhaps not recoverable • Reconaissance missions