SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
High-Performance GPU
Programming for Deep Learning
7 April 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.ner va na
High-Performance GPU kernels for deep learning
2
• Fast matrix multiply for small minibatches
• Direct convolution leveraging GEMM advances
• Even faster convolution with Winograd
Proprietary and confidential. Do not distribute.ner va na
GEMM: Basics
3
C = AB
Proprietary and confidential. Do not distribute.ner va na
GEMM: Memory Load
4
Outer product contiguous Outer product strided
threads
memory load
single tile
batched GEMM
Proprietary and confidential. Do not distribute.ner va na
Batched GEMM tiles 32 x 32
GEMM tile 32 x 64GEMM tile 32 x 32
GEMM: Tile sizes
5
threads
shared memory load
Proprietary and confidential. Do not distribute.ner va na
hGEMM Results - NN
6
Nx3072x3072 NN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
GFLOPS
Proprietary and confidential. Do not distribute.ner va na
hGEMM Results - TN
7
GFLOPS
Nx3072x3072 TN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
Proprietary and confidential. Do not distribute.ner va na
Direct convolution is still relevant
8
• Striding
• Odd-size filters
• Placeholder until faster algo can be implemented
• Often faster for single image or first small C layer
Proprietary and confidential. Do not distribute.ner va na
Direct convolution: implementation details
9
• Batched GEMM for efficient transpose and higher occupancy
• Compound outer product block remapping
• Square wave pattern for P,Q block mapping
• Slicing: shared memory lookup + integer division
• N vs C contiguous
• Single P,Q vs tiled P,Q
• Bprop as upside down fprop
• Update specific optimizations
Proprietary and confidential. Do not distribute.ner va na
Winograd: input transform
10
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros
Proprietary and confidential. Do not distribute.ner va na
Winograd: filter transform
11
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently
Proprietary and confidential. Do not distribute.ner va na
Winograd: batched GEMM
12
Proprietary and confidential. Do not distribute.ner va na
Winograd: output transform
13
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile
Proprietary and confidential. Do not distribute.ner va na 14
Performance: VGG
VGG fp32 - Totals by operation
0
0.5
1
1.5
2
64 32 16 8 4 2 1
Winograd fp32 fprop
Winograd fp32 bprop
Winograd fp32 update
cuDNN fp32 fprop
cuDNN fp32 bprop
cuDNN fp32 update
AlgorithmicSpeedup
Batch Size
Proprietary and confidential. Do not distribute.ner va na
Performance: Alexnet convolutional layers
15
Alexnet Totals
0
0.5
1
1.5
2
128 64 32 16 8 4
Nervana fp16
Nervana fp32
CuBLAS fp16
CuBLAS fp32
Batch Size
AlgorithmicSpeedup
Proprietary and confidential. Do not distribute.ner va na
Compounding
16
• alpha / beta
• bias
• relu, prelu, tanh, …
• bprop relu, …
• bprop bias
• batchnorm mean
Compounding inside of GEMM and conv for free:
Proprietary and confidential. Do not distribute.ner va na
Summary
17
• Nervana has the fastest tools for deep learning
• neon with state-of-the-art Maxwell kernels
• Nervana Cloud with multi-GPU training
• Watch for Nervana Engine, our deep learning processor

Contenu connexe

Tendances

A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
Takahiro Harada
 
Ece512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutionsEce512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutions
nadia abd
 
xilinx fpga problems
xilinx fpga problemsxilinx fpga problems
xilinx fpga problems
Anish Gupta
 

Tendances (20)

Unit 5 vsp
Unit 5 vspUnit 5 vsp
Unit 5 vsp
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
 
Multi core k means
Multi core k meansMulti core k means
Multi core k means
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Ece512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutionsEce512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutions
 
Gsm attacks
Gsm attacksGsm attacks
Gsm attacks
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
 
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
 
The Internet
The InternetThe Internet
The Internet
 
Multi-Jet Generation -status report-
Multi-Jet Generation -status report-Multi-Jet Generation -status report-
Multi-Jet Generation -status report-
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
 
xilinx fpga problems
xilinx fpga problemsxilinx fpga problems
xilinx fpga problems
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
 
Rules of block diagram
Rules of block diagramRules of block diagram
Rules of block diagram
 
Grincon U.S. 2019 How to Mine Grin
Grincon U.S. 2019 How to Mine GrinGrincon U.S. 2019 How to Mine Grin
Grincon U.S. 2019 How to Mine Grin
 
Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 

En vedette

GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
NVIDIA
 
ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4
zukun
 
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev220160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
Tomokazu Kanazawa
 

En vedette (20)

Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike WangIntroduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligence
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntroduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres Rodriguez
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
 
The AI Era Ignited by GPU Deep Learning
The AI Era Ignited by GPU Deep Learning The AI Era Ignited by GPU Deep Learning
The AI Era Ignited by GPU Deep Learning
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
20161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#0220161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#02
 
ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4
 
Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)
 
Deep learning tutorial (i)
Deep learning tutorial (i)Deep learning tutorial (i)
Deep learning tutorial (i)
 
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev220160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Video Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model ExampleVideo Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model Example
 
Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
 
Urs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in BostonUrs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in Boston
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
 

Similaire à High-Performance GPU Programming for Deep Learning

Matrix glitcher tutorial
Matrix glitcher tutorialMatrix glitcher tutorial
Matrix glitcher tutorial
José Mota
 
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex Vlachos
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
changehee lee
 
new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86
Droidcon Berlin
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
Edge AI and Vision Alliance
 
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex Vlachos
 

Similaire à High-Performance GPU Programming for Deep Learning (20)

Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1
 
Matrix glitcher tutorial
Matrix glitcher tutorialMatrix glitcher tutorial
Matrix glitcher tutorial
 
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
 
OpenGL for 2015
OpenGL for 2015OpenGL for 2015
OpenGL for 2015
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
 
OBDPC 2022
OBDPC 2022OBDPC 2022
OBDPC 2022
 
DC GAN - GO GAME
DC GAN - GO GAMEDC GAN - GO GAME
DC GAN - GO GAME
 
WebRender (MadRust)
WebRender (MadRust)WebRender (MadRust)
WebRender (MadRust)
 
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Volodymyr Lyubinets “Generative models for images”
Volodymyr Lyubinets  “Generative models for images”Volodymyr Lyubinets  “Generative models for images”
Volodymyr Lyubinets “Generative models for images”
 
Dissecting and fixing Vulkan rendering issues in drivers with RenderDoc
Dissecting and fixing Vulkan rendering issues in drivers with RenderDocDissecting and fixing Vulkan rendering issues in drivers with RenderDoc
Dissecting and fixing Vulkan rendering issues in drivers with RenderDoc
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The Surge
 

Plus de Intel Nervana

Plus de Intel Nervana (10)

Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at Galvanize
 
Women in AI kickoff
Women in AI kickoff Women in AI kickoff
Women in AI kickoff
 
ODSC West
ODSC WestODSC West
ODSC West
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for Robotics
 
RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will Constable
 
Urs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural NetworksUrs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural Networks
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognition
 

Dernier

Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
pritamlangde
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Dernier (20)

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 

High-Performance GPU Programming for Deep Learning

  • 1. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™
  • 2. Proprietary and confidential. Do not distribute.ner va na High-Performance GPU kernels for deep learning 2 • Fast matrix multiply for small minibatches • Direct convolution leveraging GEMM advances • Even faster convolution with Winograd
  • 3. Proprietary and confidential. Do not distribute.ner va na GEMM: Basics 3 C = AB
  • 4. Proprietary and confidential. Do not distribute.ner va na GEMM: Memory Load 4 Outer product contiguous Outer product strided threads memory load single tile batched GEMM
  • 5. Proprietary and confidential. Do not distribute.ner va na Batched GEMM tiles 32 x 32 GEMM tile 32 x 64GEMM tile 32 x 32 GEMM: Tile sizes 5 threads shared memory load
  • 6. Proprietary and confidential. Do not distribute.ner va na hGEMM Results - NN 6 Nx3072x3072 NN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N) GFLOPS
  • 7. Proprietary and confidential. Do not distribute.ner va na hGEMM Results - TN 7 GFLOPS Nx3072x3072 TN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N)
  • 8. Proprietary and confidential. Do not distribute.ner va na Direct convolution is still relevant 8 • Striding • Odd-size filters • Placeholder until faster algo can be implemented • Often faster for single image or first small C layer
  • 9. Proprietary and confidential. Do not distribute.ner va na Direct convolution: implementation details 9 • Batched GEMM for efficient transpose and higher occupancy • Compound outer product block remapping • Square wave pattern for P,Q block mapping • Slicing: shared memory lookup + integer division • N vs C contiguous • Single P,Q vs tiled P,Q • Bprop as upside down fprop • Update specific optimizations
  • 10. Proprietary and confidential. Do not distribute.ner va na Winograd: input transform 10 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros
  • 11. Proprietary and confidential. Do not distribute.ner va na Winograd: filter transform 11 • Filter transform • Same as input but with different coefficients • Transform each feature map independently
  • 12. Proprietary and confidential. Do not distribute.ner va na Winograd: batched GEMM 12
  • 13. Proprietary and confidential. Do not distribute.ner va na Winograd: output transform 13 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile
  • 14. Proprietary and confidential. Do not distribute.ner va na 14 Performance: VGG VGG fp32 - Totals by operation 0 0.5 1 1.5 2 64 32 16 8 4 2 1 Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update cuDNN fp32 fprop cuDNN fp32 bprop cuDNN fp32 update AlgorithmicSpeedup Batch Size
  • 15. Proprietary and confidential. Do not distribute.ner va na Performance: Alexnet convolutional layers 15 Alexnet Totals 0 0.5 1 1.5 2 128 64 32 16 8 4 Nervana fp16 Nervana fp32 CuBLAS fp16 CuBLAS fp32 Batch Size AlgorithmicSpeedup
  • 16. Proprietary and confidential. Do not distribute.ner va na Compounding 16 • alpha / beta • bias • relu, prelu, tanh, … • bprop relu, … • bprop bias • batchnorm mean Compounding inside of GEMM and conv for free:
  • 17. Proprietary and confidential. Do not distribute.ner va na Summary 17 • Nervana has the fastest tools for deep learning • neon with state-of-the-art Maxwell kernels • Nervana Cloud with multi-GPU training • Watch for Nervana Engine, our deep learning processor