NVIDIA CEO Jen-Hsun Huang introduces NVLink and shares a roadmap of the GPU. Primary topics also include an introduction of the GeForce GTX Titan Z, CUDA for machine learning, and Iray VCA.
6. Takayuki Aoki
Global Scientific Information and Computing Center Tokyo Institute of Technology
“ Large-scale CFD Applications and a Full GPU Implementation of a Weather Prediction Code on the TSUBAME Supercomputer
”
8. INTRODUCING NVLINK
CPU
GPU
PCIe
Differential with embedded clock
PCIe programming model (w/ DMA+)
Unified Memory
Cache coherency in Gen 2.0
5 to 12X PCIe
9. 5X More Bandwidth for Multi-GPU Scaling
GPU
PCIe SWITCH
CPU
GPU
GPU
GPU
10. 3D MEMORY
3D Chip-on-Wafer integration
Many X bandwidth
2.5X capacity
4X energy efficiency
0
200
400
600
800
1000
1200
2008
2010
2012
2014
2016
Memory Bandwidth
11. Blaise Pascal
1623-1662
Mechanical Calculator
Probability Theory
Pascal’s Theorem
Pascal’s Law
12. PASCAL
NVLink
3D Memory
Module
5 to 12X PCIe 3.0
2 to 4X memory BW & size
1/3 size of PCIe card
13. SGEMM / W Normalized
2012
2014
2008
2010
2016
Tesla
CUDA
Fermi
FP64
Kepler
Dynamic Parallelism
Maxwell
DX12
Pascal
Unified Memory
3D Memory
NVLink
20
16
12
8
6
2
0
GPU ROADMAP
4
10
14
18
14. MACHINE LEARNING
Branch of Artificial Intelligence
Computers that learn from data
person
car
helmet
motorcycle
bird
frog
person
dog
chair
person
hammer
flower pot
power drill
16. Building High-level Features Using Large Scale Unsupervised Learning
Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng
Stanford / Google
1 billion connections
10 million 200x200 pixel images
1,000 machines (16,000 cores)
3 days
17. 1,000 CPU Servers 2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
GOOGLE BRAIN
Today’s Largest Networks
1B connections
10M images
~3 days
~30 ExaFLOPS
Human Brain
~100B neurons x 1000 connections
500M images
5,000,000X “Google Brain”
~150 YottaFLOPS
~40,000 “Google Brain-Years”
SOURCE: Ian Goodfellow
18. Deep Learning with COTS HPC Systems
A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro
Stanford / NVIDIA • ICML 2013
STANFORD AI LAB
3 GPU-Accelerated Servers 12 GPUs • 18,432 cores
4 kWatts
$33,000
Now You Can Build Google’s $1M Artificial Brain on the Cheap
“
“
-Wired
1,000 CPU Servers 2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
GOOGLE BRAIN
41. “10 of the Top 10” Greenest Supercomputers Powered by CUDA GPUs
42. Unify GPU and Tegra Architecture
192 fully programmable CUDA cores
326 GFLOPS
4X energy efficiency over A15
TEGRA K1 Mobile Super Chip
MOBILE ARCHITECTURE
Maxwell
Kepler
Tesla
Fermi
Tegra 3
Tegra 4
Tegra K1
GPU ARCHITECTURE
43. Computer Vision on CUDA
Feature Detection / Tracking
~30 GFLOPS @ 30 Hz
Object Recognition / Tracking
~180 GFLOPS @ 30 Hz
3D Scene Interpretation
~280 GFLOPS @ 30 Hz
44. JETSON TK1 1st MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS
192 CUDA cores
326 GFLOPS
VisionWorks SDK
$192
45. VISIONWORKS
COMPUTER VISION ON CUDA
Driver Assistance Computational Photography
Augmented Reality Robotics
CUDA
Jetson TK1
VisionWorks Primitives
Your Code
Sample Pipelines
Object Detection /
Tracking
Structure from Motion …
Classifier Corner Detection …
46. Single Precision GFLOPS / W Normalized
80
60
0
40
2013
2014
2011
2012
2015
Tegra 2
Tegra 3
Tegra 4
Tegra K1
Kepler GPU
CUDA
64b & 32b CPU
Erista
Maxwell GPU
20
TEGRA ROADMAP