Rethinking computation: A processor architecture for machine intelligence

Proprietary and conﬁdential. Do not distribute.
Rethinking computation:
A processor architecture for
machine intelligence
17 May 2016
Amir Khosrowshahi
Co-founder and CTO, Nervana
MAKING MACHINES SMARTER.™

ner va na
About nervana
2
• A platform for machine intelligence
• enable deep learning at scale
• optimized from algorithms to silicon
X

ner va na
Model and substrate for computation
3
Functional model
? ?
Machine learning modelMammalian cortex
Hard!

ner va na
Model and substrate for computation
4
Custom ASIC Deep learning model
• Model description language
• Hardware abstraction layer
• Distributed primitives
• Compilers, drivers
Feasible, but still hard.Do this instead:

ner va na
Application areas
5
Healthcare Agriculture Finance
Online Services Automotive Energy

ner va na
nervana cloud
6
Images
Text
Tabular
Speech
Time series
Video
Data
import trainbuild deploy
Cloud

ner va na
Deep learning as a core technology
7
DL
Photos
Maps
Voice
Search
Self-driving
car
Ad
Targeting
Machine
Translation
‘Google Brain’ model
DL
Image
Classification
Object
Localization
Video
Indexing
Speech
Recognition
Nervana Platform
Natural
Language

ner va na
nervana neon
8
• Fastest library

ner va na
nervana neon
8
• Fastest library
• Model support Models
• Convnet
• RNN, LSTM
• MLP
• DQN
• NTM
Domains
• Images
• Video
• Speech
• Text
• Time series

ner va na
Running locally:
% python rnn.py # or neon rnn.yaml
Running in nervana cloud:
% ncloud submit —py rnn.py # or —yaml rnn.yaml
% ncloud show <model_id>
% ncloud list
% ncloud deploy <model_id>
% ncloud predict <model_id> <data> # or use REST api
nervana neon
8
• Fastest library
• Model support
• Cloud integration

ner va na
Backends
• CPU
• GPU
• Multiple GPUs
• Parameter server
• (Xeon Phi)
• nervana TPU
nervana neon
8
• Fastest library
• Model support
• Multiple backends

ner va na
nervana neon
8
• Fastest library
• Model support
• Multiple backends
• Optimized at assembler level

ner va na
=1
nervana
engine
@200 watts
10 GPUs
@2000 watts
200 CPUs
@20,000 watts
nervana tensor processing unit (TPU)
9
• Unprecedented compute density

ner va na
9
• Scalable distributed architecture
nn
n n
nn
nn

ner va na
Instruction
and data
memory
Ctrl
ALU
CPU
Data
Memory
Ctrl
Nervana
9
• Memory near computation

ner va na
9
• Learning and inference
• Exploit limited precision
• Incorporate latest advances
• Power efficiency

ner va na
• 10-100x gain
• Architecture optimized for
algorithm
9
• Learning and inference
• Exploit limited precision
• Incorporate latest advances
• Power efficiency

ner va na
General purpose computation
10
2000s: SoC
Motivation: reduce power
and cost, fungible
computing.
Enabled inexpensive
mobile devices.

ner va na
Dennard scaling has ended
11
What’s next?
Transistors
Clock speed
Power
Perf / clock

ner va na
Many-core tiled architectures
12
Tile Processor Architecture Overview for the TILEPro Series 5
and provides high bandwidth and extremely low latency communication among tiles. The Tile
Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-
ble multicore processor. External memory and I/O interfaces are connected to the tiles via the
iMesh interconnect.
Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s
structure.
Figure 2-1. Tile Processor Hardware Architecture
Each tile is a powerful, full-featured computing system that can independently run an entire oper-
ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a
three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC),
cache, and DMA subsystem. An individual tile is capable of executing up to three operations per
cycle.
CDN
TDN
IDN
MDN
STN
UDN
1,1 6,1
3,2 4,2 5,2 6,2 7,2
XAUI
(10GbE)
TDN
IDN
MDN
STN
UDN
LEGEND:
Tile Detail
port2
msh0
port0
port2 port1 port0
DDR2
DDR2
port0
msh1
port2
port0 port1 port2
DDR2
DDR2
RGMII
(GbE)
XAUI
(10GbE)
FlexI/O
PCIe
(x4 lane)
I2C, JTAG,
HPI, UART,
SPI ROM
FlexI/O
PCIe
(x4 lane)
port1 port1
msh3 msh2
port2
msh0
port0
port2 port1 port0
port0
msh1
port2
port0 port1 port2
port1 port1
msh3 msh2
gpio1
port0
port1
port1
port0
port1
xgbe0
gbe0
xgbe1
port0
gpio1
port1
port0
port1
gbe1
port0
port1
xgbe0
xgbe1
port0
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7
7,00,0 1,0 2,0 3,0 4,0 5,0 6,0
0,1 1,1 6,12,1 3,1 4,1 5,1 7,1
3,2 4,2 5,2 6,2 7,20,2 1,2 2,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
port0
7,0
port0
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
Switch
Engine
Cache
Engine
Processor
Engine
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
STNSTN
TDNTDN
IDNIDN
MDNMDN
UDNUDN
CDNCDN
2010s: multi-core, GPGPU
Motivation: increased
performance without clock
rate increase or smaller
devices.
Requires changes in
programming paradigm.
NVIDIA GM204Tilera
Intel Xeon Phi
Knight’s landing

ner va na
Special purpose computation: Anton
13
flex
(b)(a)
flex flex flex
flex flex flex flex
flex flex flex flex
HTIS HTIS
flex flex flex flexX+
X+
Y-
Y+
Z-
Z+
HOST LA
X-
X-
Y-
Y+
Z-
Z+
(c)
(a) The Anton 2 ASICs are directly connected by high-speed channels to form a three-dimensional torus topology. (b) Schematic view of an Anton
contains 2 connections to each neighbor in the torus topology, 16 flexible subsystem (“flex”) tiles, 2 high-throughput interaction subsystem (HTIS) t
erface (HOST), and an on-die logic analyzer (LA). (c) Physical layout of a 20.4 mm × 20 mm Anton 2 ASIC implemented in 40-nm technology. On
(Shaw et al., 2014)

ner va na
Computational motifs
14
Motif Examples
1 Dense linear algebra Matrix multiply (GEMM)
2 Sparse linear algebra SpMV
3 Spectral methods FFT
4 N-Body methods Molecular dynamics
5 Structured grids Lattice Boltzmann
6 Unstructured grids CFD
7 Map-Reduce Expectation
maximization8 Combinational logic Encryption, hashing
9 Graph traversal Decision trees, quicksort
10 Dynamic programming Forward-backward
11 Bactrack, branch and bound Constraint satisfaction
12 Graphical models HMM, Bayesian
networks13 Finite state machines Compilers
(Asanovic et al., 2006)
• Silicon
• Software
• Neural network
architectures!
Can be implemented using:

ner va na
Summary
15
• Computers are tools for solving problems of their time
• Was: Coding, calculation, graphics, web
• Today: Learning and Inference on data
• Deep learning as a computational paradigm
• Custom architecture can do vastly better
• We are hiring! Summer interns and full time.

Rethinking computation: A processor architecture for machine intelligence

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (13)

Similaire à Rethinking computation: A processor architecture for machine intelligence

Similaire à Rethinking computation: A processor architecture for machine intelligence (20)

Dernier

Dernier (20)

Rethinking computation: A processor architecture for machine intelligence