ARM was founded in 1990 and developed the first ARM processor in 1985. Key developments include the ARM2 commercial processor in the 1980s, the ARM7 used in early Nokia phones in the 1990s, and the modern Cortex-A9 which provides improved performance and power efficiency through features like out-of-order execution and cache hierarchies. NEON is ARM's SIMD architecture extension for improved media and signal processing, and it is now widely used in mobile software from Android to ffmpeg to accelerate tasks like video encoding and FFTs.
1. A brief history of ARM
First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors
50mm2, consumed 120mW of power
ARM Architecture & NEON Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline
ARM founded in October 1990, separate company (Apple had 43% stake)
ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994
Ian Rickards
Stanford University 28 Apr 2010
1 2
ARM 25 years later: Cortex-A9 MP Cortex-A9 Processor Microarchitecture
1-4 way MP with optimized MESI Introduces out-of-
order instruction
16KB, 32KB, 64KB I & D caches issue and
completion
128KB-8MB L2
Multi-issue, Speculation, Renaming, OOO Register
renaming to
High performance FPU option enable execution
speculation
NEON SIMD option
Thumb-2 Non-blocking
memory system
AXI bus with load-store
forwarding
Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON) Fast loop mode in
instruction pre-
40G “Low Power” macro: ~5mm2, 800MHz, 0.5W fetch to lower
power
40G “High Performance” macro: ~7mm2 2GHz (typ), 2W consumption
3 4
2. Cortex-A9 MPCore Multicore Structure Hard Macro Configuration and Floorplan
Configurable Between 1 and Hardware Coherence for
4 CPUs with optional Cache, MMU and TLB
NEON and/or Floating-point maintenance operations
Unit
FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE
Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Coherent access to
Flexible configuration processor caches
and power-aware from accelerators
interrupt controller
Instruction Data Instruction Data Instruction Data Instruction Data and DMA
Cache Cache Cache Cache Cache Cache Cache Cache
falcon_cpu floorplan
Snoop Control Unit (SCU)
Generalized Accelerator Osprey configuration includes level 2 cache controller
Interrupt Control Coherence and Cortex A9 integration level
and Distribution Cache-2-Cache Snoop Port
Transfers Filtering
Timers Top level includes Coresight PTM, CTI and CTM
Implementation using r1p1 version of Cortex A9
Dual core
32k I$ and D$
Advanced Bus Interface Unit
NEON present on both cores
Design flexibility PTM interface present
Secure and over memory
Virtualization aware throughput and 128 interrupts
interrupt and IPI latency
communications L2 Cache Controller (PL310) 128K-8MB ACP present
Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering Two AXI master ports
Elba top level floorpan Level 2 cache memories external (interface exposed)
5 6
Why is ARM looking at “G” processes? Understanding power
“G” can achieve around double the MHz than “LP” Fundamental power parameters
Active power is lower on “G” than “LP” Average power => battery life
Thermal Power sustained power @ max performance
Example, Push 40LP to 800MHz, to compare with 800MHz MID macro
GUI updates web page render
music
The estimated LP numbers
correlate to an accelerated
implementation of an A8
Power
Traditional LP process
G is close in terms of power if
lowered to same performance as 2-3x faster
on LP.
clock
Power 40G process
G can scale much higher in terms
of performance than LP.
Key requirement is “run and power
power off” quickly off power off power off
Power Osprey
7 8
3. Power Domains Single-thread Coremarks/MHz
HiP and MID macros have same power Single-thread performance is key for GUI based applications
domains A9_PL310
Both use distributed coarse grain power
A9_PL310_noram
switches
Power plan for CPUs is symmetric “Osprey macro”
Atom 1.85
A9 core and its L1 is power gated in Data Data
Engine 0 Engine 1
lockstep
PTM/Debug
Cortex-A9 2.95
Note that all power domains are only ON A9 CORE 1
A9 CORE 0
or OFF, there is no hardware retention + 32K I/D + 32K I/D
Cortex-A8 2.72
mode
Software routine enables retention to RAM SCU + PL310_noram
1004K 2.33
L2 Cache RAM 74K 2.30
512/1024KB
0.00 0.50 1.00 1.50 2.00 2.50 3.00
9 10
Floating Point Performance Higher Flash Actionscript from A9
Intel
11 12
4. ARM Architecture evolution Dummies’ guide to Si implementation
Some not-entirely-RISC features Basic Fab tech
LDM / STM 65nm, 40nm, 32nm, 28nm, etc.
Full predicated execution (ADDEQ r0, r1, r2) G vs. LP technology
Carefully designed with customer/partner input considering gatecount 40G is 0.9V process, 40LP is 1.1V process
Much lower leakage with LP, but half the performance
Thumb
Intermediate “LPG” from TSMC too! Island of G within LP
16-bit instruction set (mostly using r0-r7) selected for compiler requirements
Vt’s – each Vt requires additional mask step
Design goals: performance from 16-bit wide ROM, codesize HVt – lower leakage, but slower
Thumb-2 in Cortex-A extends original Thumb (allows 32-bit/16-bit mix) RVt – regular Vt
Beneficial today – better performance from small caches LVt – faster, but high leakage esp. at high temperature
Jazelle Cell library track size
CPU mode allows direct execution of Java bytecodes 9-track, 12-track, 15-track (bigger => more powerful)
~60% of Java bytecodes directly executed by datapath Backed off implementation vs. pushed implementation
Top of Java stack stored in registers High-K metal Gate
Widely used in Nokia & DoCoMo handsets Clock gating
… Well biasing…
13 14
ARM Architecture Evolution What is NEON?
NEON is a wide SIMD data processing architecture
Key Technology
Additions by Extension of the ARM instruction set
Architecture Generation 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)
Thumb-EE
Execution NEON Instructions perform “Packed SIMD” processing
VFPv3
Environments: Registers are considered as vectors of elements of the same data type
ARM11 Improved
memory use Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
NEON™
Adv SIMD Instructions perform the same operation in all lanes
Improved
Thumb®-2 Media and Source
Source
DSP Registers
Registers
ARM9 TrustZone™ Elements
Dn
ARM10 Dm
SIMD Low Cost Operation
MCU
VFPv2
Dd Destination
Jazelle® Thumb-2 Only Register
V5 V6 V7 A&R V7 M Lane
15 16
5. Data Types Registers
NEON natively supports a set of common data types NEON provides a 256-byte register file
Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit Distinct from the core registers
32-bit Single-precision Floating-point Extension to the VFPv2 register file (VFPv3)
.S8
Signed, Unsigned
8/16-bit Signed, .I8 D0
Integers; .8 .U8 Two explicitly aliased views Q0
Unsigned Integers; D1
Polynomials .P8 32 x 64-bit registers (D0-D31)
Polynomials D2
Q1
.S16
.I16 16 x 128-bit registers (Q0-Q15) D3
.16 .U16
.P16 : :
.I32
.S32 Enables register trade-off D30
32-bit Signed, .32 .U32 64-bit Signed, Vector length Q15
D31
Unsigned .F32 Unsigned
Integers; Floats .S64 Integers; Available registers
.64 .I64
.U64
Also uses the summary flags in the VFP FPSCR
Adds a QC integer saturation summary flag
Data types are represented using a bit-size and format letter No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit -> 32-bit)
17 18
Vectors and Scalars NEON in Audio
Registers hold one or more elements of the same data type FFT: 256-point, 16-bit signed complex numbers
Vn can be used to reference either a 64-bit Dn or 128-bit Qn register FFT is a key component of AAC, Voice/pattern recognition etc.
A register, data type combination describes a vector of elements
Hand optimized assembler in both cases
63 0 127 0 FFT time No NEON With NEON
Dn Qn
(v6 SIMD asm) (v7 NEON asm)
I64 D0 F32 F32 F32 F32 Q0 Cortex-A8 500MHz 15.2 us 3.8 us
S32 S32 D7 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 Q7 Actual silicon (x 4.0 performance)
64-bit 128-bit
Some instructions can reference individual scalar elements
Extreme example: FFT in ffmpeg: 12x faster
Scalar elements are referenced using the array notation Vn[x]
F32 F32 F32 F32 C code -> handwitten asm
Q0
Q0[3] Q0[2] Q0[1] Q0[0] Scalar -> vector processing
Array ordering is always from the least significant bit Unpipelined FPU -> pipelined NEON single precision FPU
19 20
6. How to use NEON For NEON instruction reference
OpenMAX DL library Official NEON instruction Set reference is “Advanced SIMD” in
Library of common codec components and signal processing routines ARM Architecture Reference Manual v7 A & R edition
Status: Released on http://www.arm.com/products/esd/openmax_home.html
Available to partners & www.arm.com request system
Vectorizing Compilers
Exploits NEON SIMD automatically with existing source code
Status: Released (in RVDS 3.1 Professional and later)
Status: Codesourcery 2007q3 gcc and later
C Instrinsics
C function call interface to NEON operations
Supports all data types and operations supported by NEON
Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)
Assembler
For those who really want to optimize at the lowest level
Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)
21 22
ARM RVDS & gcc vectorising compiler Intrinsics
Include intrinsics header file
|L1.16|
VLD1.32 {d0,d1},[r0]!
#include <arm_neon.h>
int a[256], b[256], c[256];
SUBS r3,r3,#1
foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]!
int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1 Use special NEON data types which correspond to D and Q registers, e.g.
VST1.32 {d0,d1},[r2]!
BNE |L1.16|
int8x8_t D-register containing 8x 8-bit elements
for (i=0; i<256; i++){
int16x4_t D-register containing 4x 16-bit elements
a[i] = b[i] + c[i];
int32x4_t Q-register containing 4x 32-bit elements
}
} .L2:
add r1, r0, ip
add r3, r0, lr Use special intrinsics versions of NEON instructions
add r2, r0, r4
gcc -S -O3 -mcpu=cortex-a8
add r0, r0, #8
vin1 = vld1q_s32(ptr);
-mfpu=neon -ftree-vectorize cmp r0, #1024 vout = vaddq_s32(vin1, vin2);
-ftree-vectorizer-verbose=6 fldd d7, [r3, #0] vst1q_s32(vout, ptr);
test.c fldd d6, [r2, #0]
vadd.i32 d7, d7, d6
fstd d7, [r1, #0]
bne .L2 Strongly typed!
armcc generates better NEON code Use vreinterpret_s16_s32( ) to change the type
(gcc can use Q-regs with ‘-mvectorize-with-neon-quad’ )
23 24
7. NEON in opensource Many different levels of parallelism
Bluez – official Linux Bluetooth protocol stack
NEON sbc audio encoder
Pixman (part of cairo 2D graphics library)
Compositing/alpha blending
X.Org, Mozilla Firefox, fennec, & Webkit browsers
e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON
Multi-issue parallelism
ffmpeg – libavcodec
LGPL media player used in many Linux distros
NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora
NEON Audio: AAC, Vorbis, WMA
NEON SIMD parallelism
x264 – Google Summer Of Code 2009
GPL H.264 encoder – e.g. for video conferencing
Android – NEON optimizations Multi-core parallelism
Skia library, S32A_D565_Opaque 5x faster using NEON
Available in Google Skia tree from 03-Aug-2009
Eigen2 linear algebra library
Ubuntu 09.04 – supports NEON
NEON versions of critical shared-libraries
25 26
ffmpeg (libavcodec) performance Scalability with SMP on Cortex-A9
git.ffmpeg.org
snapshot 21-Sep-09
YouTube HQ video decode
480x270, 30fps
Including AAC audio
Real silicon measurements
OMAP3 Beagleboard
ARM A9TC
NEON ~2x overall
performance
27 28
9. Multiple 1-Element Structure Access
VLD1, VST1 provide standard array access
An array of structures containing a single component is a basic array
List can contain 1, 2, 3 or 4 consecutive registers
Transfer multiple consecutive 8, 16, 32 or 64-bit elements
[R1] x0
Quick review of NEON instructions +2 x1
[R4] x0 +4 x2
+2 x1 +6 x3
+R3 +4 x2 +8 x4
+6 x3 +10 x5
: x3 x2 x1 x0 D7
+12 x6 x3 x2 x1 x0 D3
VLD1.16 {D7}, [R4], R3 +14 x7
x7 x6 x5 x4 D4
:
VST1.16 {D3,D4}, [R1]
33 34
Addition: Basic Example – adding all lanes
NEON supports various useful forms of basic Input in Q0 (D0 and D1) DO D1
addition
VADD.I16 D0, D1, D2 u16 input values DO D1
Normal Addition - VADD, VSUB VSUB.F32 Q7, Q1, Q4
Floating-point VADD.I8 Q15, Q14, Q15 VPADDL.U16 Q0, Q0
Integer (8-bit to 64-bit elements) VSUB.I64 D0, D30, D5
64-bit and 128-bit registers DO D1
Now Q0 contains 4x u32 values DO
Long Addition - VADDL, VSUBL VADDL.U16 Q1, D7, D8
(with 15 headroom bits)
Promotes both inputs before operation VSUBL.S32 Q8, D1, D5 VPADD.U32 D0, D0, D1
Signed/unsigned (8-bit to 32-bit source
elements)
Reducing/folding operation DO
VADDW.U8 Q1, Q7, D8
needs 1 bit of headroom
Wide Addition - VADDW, VSUBW DO
VSUBW.S16 Q8, Q1, D5
Promotes one input before operation
Signed/unsigned (8-bit 32-bit source elements) VPADDL.U32 D0, D0
35 36
10. Exercise 2 - summing a vector
+
+
+
+ +
+ + Some NEON clever features
+ +
+
+
DO D1
+
+
DO
+
+ DO
+
37 38
Data Movement: Table Lookup Element Load Store Instructions
Uses byte indexes to control byte look up in a table All treat memory as an array of structures (AoS)
Table is a list of 1,2,3 or 4 adjacent registers SIMD registers are treated as structure of arrays (SoA)
Enables interleaving/de-interleaving for efficient SIMD processing
11 4 8 13 26 8 0 3 D3 Transfer up to 256-bits in a single instruction
x3 z2 y2 x2 z1 y1 x1 z0 y0 x0
0 p o n m l k j i h g f e d c b a {D1,D2}
element 3-element structure
l e i n 0 i a d D0
Three forms of Element Load Store instructions are provided
VTBL.8 D0, {D1, D2}, D3
Forms distinguished by type of register list provided
Multiple Structure Access e.g. {D0, D1}
VTBL : out of range indexes generate 0 result
Single Structure Access e.g. {D0[2], D1[2]}
VTBX : out of range indexes leave destination unchanged Single Structure Load to all lanes e.g. {D0[], D1[]}
39 40
11. Multiple 2-Element Structure Access Multiple 3/4-Element Structure Access
VLD2, VST2 provide access to multiple 2-element structures VLD3/4, VST3/4 provide access to 3 or 4-element structures
List can contain 2 or 4 registers Lists contain 3/4 registers; optional space for building 128-bit vectors
Transfer multiple consecutive 8, 16, or 32-bit 2-element structures Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures
[R3] x0 [R1] x0
[R1] x0
+2 y0 +2 y0
+2 y0
[R1] x0 +4 x1 +4 z0
+4 z0
+2 y0 +6 y1 +6 x1
! +6 x1
+4 x1 +8 x2 +8 y1 x3 x2 x1 x0 D0
+8 y1
+6 y1 +10 y2 x3 x2 x1 x0 D0 +10 z1
! +10 z1 D1
+8 x2 +12 x3 +12 x2 x3 x2 x1 x0 D3
x7 x6 x5 x4 D1 +12 x2
+10 y2 : : y3 y2 y1 y0 D2
y3 y2 y1 y0 D4 :
+12 x3 x3 x2 x1 x0 D2 +28 x7 y3 y2 y1 y0 D2 +20 y3
+20 y3 D3
+14 y3 +30 y7 +22 z3 z3 z2 z1 z0 D5
y3 y2 y1 y0 D3 y7 y6 y5 y4 D3 +22 z3
: : : z3 z2 z1 z0 D4
:
VLD2.16 {D2,D3}, [R1] VLD2.16 {D0,D1,D2,D3}, [R3]! VST3.16 {D3,D4,D5}, [R1]
VLD3.16 {D0,D2,D4}, [R1]!
41 42
Logical Alignment hints on NEON load/store
NEON supports bitwise logical operations NEON data load/store: VLDn/VSTn
Full unaligned support for NEON data access
Instruction contains ‘alignment hint’ which permits implementations to be faster when
VAND D0, D0, D1
address is aligned and hint is specified.
VAND, VBIC, VEORR, VORN, VORR VORR Q0, Q1, Q15
Usage: base address specified as [<Rn>:<align>]
Bitwise logical operation VEOR Q7, Q1, Q15
Note it is a programming error to specify hint, but use incorrectly aligned address
VORN D15, D14, D1
Independent of data type VBIC D0, D30, D2
Alignment hint can be :64, :128, :256 (bits) depending on number of D-regs
64-bit and 128-bit registers
VLD1.8 {D0}, [R1:64]
D0 VLD1.8 {D0,D1}, [R4:128]!
VBIT, VBIF, VBSL D1 VLD1.8 {D0,D1,D2,D3}, [R7:256]!, R2
Bitwise multiplex operations
0 1 0 1 1 0 D2 ARM ARM uses “@” but this is not recommended in source code
Insert True, Insert False, Select
GNU gas currently only accepts “[Rn,:128]” syntax – note extra “,”
3 versions overwrite different registers
D1
64-bit and 128-bit registers Applies to both Cortex-A8 and Cortex-A9 (see TRM for detailed instruction timing)
Used with masks to provide selection VBIT D1, D0, D2
43 44
12. Dual issue [Cortex-A8 only] Thank you!
NEON can dual issue NEON in the following circumstances ARM Architecture has evolved with a balance of pure RISC
No register operand/result dependencies
and customer driven input
NEON data processing (ALU) instruction
NEON load/store or NEON byte permute instruction or MRC/MCR
VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL,
VTBX
NEON offers a clean architecture targeted at compiler code
VLD1.8 {D0}, [R1]! generation, offering
VMLAL.S8 Q2, D3, D2 Unaligned access
Structure load/store operations
VEXT.8 D0, D1, D2, #1 Dual D-register/Q-register view to optimize register bank
SUBS r12, r12, #1
Balance of performance vs. gatecount
Also can dual-issue NEON with ARM instructions Cortex-A9 and ARM’s hard macros provide a scalable low-
VLD1.8 {D0}, [R1]! power solution that is suitable for a wide range of high-
SUBS r12, r12, #1 performance consumer applications
45 46