GPU Technology Conference 2014 Keynote

5
4
3
2
1
0
2003
2005
2007
2009
2011
2013
TeraFLOPS
GPU
CPU

GTC — GROWING AND EXPANDING
2010
2012
2014
397
429
729
FASTEST GROWING TOPICS
Big Data Analytics
Machine Learning
Computer Vision
FASTEST GROWING TOPICS
Energy Exploration
Life Science & Genomics
Molecular Dynamics
#1 TOPIC
HPC / Supercomputing

2012
2013
2014
FOSTERING THE GPU ECOSYSTEM Big Data / Cloud / Computer Vision
AudioStreamTV

Takayuki Aoki
Global Scientific Information and Computing Center Tokyo Institute of Technology
“ Large-scale CFD Applications and a Full GPU Implementation of a Weather Prediction Code on the TSUBAME Supercomputer
”

BANDWIDTH BOTTLENECKS
CPU
GPU
PCIe
PCI Express
CPU Memory
GPU Memory
16GB/sec
60GB/sec
288GB/sec

INTRODUCING NVLINK
CPU
GPU
PCIe
Differential with embedded clock
PCIe programming model (w/ DMA+)
Unified Memory
Cache coherency in Gen 2.0
5 to 12X PCIe

5X More Bandwidth for Multi-GPU Scaling
GPU
PCIe SWITCH
CPU
GPU
GPU
GPU

3D MEMORY
3D Chip-on-Wafer integration
Many X bandwidth
2.5X capacity
4X energy efficiency
0
200
400
600
800
1000
1200
2008
2010
2012
2014
2016
Memory Bandwidth

Blaise Pascal
1623-1662
Mechanical Calculator
Probability Theory
Pascal’s Theorem
Pascal’s Law

PASCAL
NVLink
3D Memory
Module
5 to 12X PCIe 3.0
2 to 4X memory BW & size
1/3 size of PCIe card

SGEMM / W Normalized
2012
2014
2008
2010
2016
Tesla
CUDA
Fermi
FP64
Kepler
Dynamic Parallelism
Maxwell
DX12
Pascal
Unified Memory
3D Memory
NVLink
20
16
12
8
6
2
0
GPU ROADMAP
4
10
14
18

MACHINE LEARNING
Branch of Artificial Intelligence
Computers that learn from data
person
car
helmet
motorcycle
bird
frog
person
dog
chair
person
hammer
flower pot
power drill

Machine Learning using Deep Neural Networks
Input
Result

Building High-level Features Using Large Scale Unsupervised Learning
Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng
Stanford / Google
1 billion connections
10 million 200x200 pixel images
1,000 machines (16,000 cores)
3 days

1,000 CPU Servers 2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
GOOGLE BRAIN
Today’s Largest Networks
1B connections
10M images
~3 days
~30 ExaFLOPS
Human Brain
~100B neurons x 1000 connections
500M images
5,000,000X “Google Brain”
~150 YottaFLOPS
~40,000 “Google Brain-Years”
SOURCE: Ian Goodfellow

Deep Learning with COTS HPC Systems
A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro
Stanford / NVIDIA • ICML 2013
STANFORD AI LAB
3 GPU-Accelerated Servers 12 GPUs • 18,432 cores
4 kWatts
$33,000
Now You Can Build Google’s $1M Artificial Brain on the Cheap
“
“
-Wired
600 kWatts
$5,000,000
GOOGLE BRAIN

DEMO: MACHINE LEARNING, SIMPLE TRAINING SET

1.2M
1000
2
7
25
Image training set Classes Weeks of training GPUs EXAFLOPS total to train
DEMO: MACHINE LEARNING, NYU OVERFEAT

CUDA for MACHINE LEARNING
Talks @ GTC
Image Detection
Face Recognition
Gesture Recognition
Video Search & Analytics
Speech Recognition & Translation
Recommendation Engines
Indexing & Search
Use Cases
Early Adopters
Image Analytics for Creative Cloud
Image Classification
Speech/Image Recognition
Recommendation
Hadoop
Search Rankings

Big Data & Infinite Compute Turbocharge Deep Learning
SOURCE: KPCB/Mary Meeker, company data. Unstructured data: IDC's Digital Universe Study.
800M photos uploaded per day
100 hours of video uploaded per minute
Unstructured data exploding
0
100
200
300
400
500
600
700
800
900
2007
2008
2009
2010
2011
2012
2013
2014
Facebook
Instagram
Snapchat
Flickr
0
20
40
60
80
100
120
2007
2008
2009
2010
2011
2012
2013
Hours (YouTube)
Millions
1,104
5,379
0
1,000
2,000
3,000
4,000
5,000
6,000
2010
2015
Exabytes of data

5,760 CUDA cores
12GB memory
8 TeraFLOPS
$2999

STANFORD AI LAB
1 Titan Z-Accelerated Server 3 Titan Zs • 17,280 cores
2 kWatts $12,000
600 kWatts
$5,000,000
GOOGLE BRAIN
300X energy efficiency
400X lower cost
Fits next to a desk

RenderMan with programmable shading
1.5 hours to render each frame
CCI 6/32 minicomputer
First CGI Film Nominated for an Academy Award®

State-of-the-art water simulator 48 hours to simulate the base water 250 hours to render each frame
2013 Academy Award® Winner BEST VISUAL EFFECTS

One is a photo, One is Iray…

Bunkspeed
Maya
Catia
3ds Max
IRAY VCA SCALABLE GPU RENDERING APPLIANCE
8 Kepler-class
12GB per GPU
23,040
2 x 1GigE 2 x 10GigE 1 x InfiniBand
GPUs
GPU memory
CUDA cores
Network

0
20
40
60
80
Relative Performance
CPU-only Workstation
Quadro K5000 Workstation
Iray VCA
Bunkspeed
Maya
Catia
3ds Max
IRAY VCA SCALABLE GPU RENDERING APPLIANCE
MSRP $50,000

Ben Fathi
Chief Technology Officer
Horizon DaaS Platform

“10 of the Top 10” Greenest Supercomputers Powered by CUDA GPUs

Unify GPU and Tegra Architecture
192 fully programmable CUDA cores
326 GFLOPS
4X energy efficiency over A15
TEGRA K1 Mobile Super Chip
MOBILE ARCHITECTURE
Maxwell
Kepler
Tesla
Fermi
Tegra 3
Tegra 4
Tegra K1
GPU ARCHITECTURE

Computer Vision on CUDA
Feature Detection / Tracking
~30 GFLOPS @ 30 Hz
Object Recognition / Tracking
~180 GFLOPS @ 30 Hz
3D Scene Interpretation
~280 GFLOPS @ 30 Hz

JETSON TK1 1st MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS
192 CUDA cores
326 GFLOPS
VisionWorks SDK
$192

VISIONWORKS
COMPUTER VISION ON CUDA
Driver Assistance Computational Photography
Augmented Reality Robotics
CUDA
Jetson TK1
VisionWorks Primitives
Your Code
Sample Pipelines
Object Detection /
Tracking
Structure from Motion …
Classifier Corner Detection …

Single Precision GFLOPS / W Normalized
80
60
0
40
2013
2014
2011
2012
2015
Tegra 2
Tegra 3
Tegra 4
Tegra K1
Kepler GPU
CUDA
64b & 32b CPU
Erista
Maxwell GPU
20
TEGRA ROADMAP

Andreas Reich
Head of Audi Pre-Development

CUDA EVERYWHERE
PASCAL
PC
CLOUD
MOBILE

GPU Technology Conference 2014 Keynote

GPU Technology Conference 2014 Keynote

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à GPU Technology Conference 2014 Keynote

Similaire à GPU Technology Conference 2014 Keynote (20)

Plus de NVIDIA

Plus de NVIDIA (20)

Dernier

Dernier (20)

GPU Technology Conference 2014 Keynote