SlideShare a Scribd company logo
1 of 12
Download to read offline
A brief history of ARM




                                                             First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors
                                                             50mm2, consumed 120mW of power
    ARM Architecture & NEON                                  Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline
                                                             ARM founded in October 1990, separate company (Apple had 43% stake)
                                                             ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994

                      Ian Rickards
             Stanford University 28 Apr 2010




1                                                              2




ARM 25 years later: Cortex-A9 MP                             Cortex-A9 Processor Microarchitecture
 1-4 way MP with optimized MESI                                Introduces out-of-
                                                               order instruction
 16KB, 32KB, 64KB I & D caches                                 issue and
                                                               completion
 128KB-8MB L2
 Multi-issue, Speculation, Renaming, OOO                       Register
                                                               renaming to
 High performance FPU option                                   enable execution
                                                               speculation
 NEON SIMD option
 Thumb-2                                                       Non-blocking
                                                               memory system
 AXI bus                                                       with load-store
                                                               forwarding


 Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON)     Fast loop mode in
                                                               instruction pre-
 40G “Low Power” macro: ~5mm2, 800MHz, 0.5W                    fetch to lower
                                                               power
 40G “High Performance” macro: ~7mm2 2GHz (typ), 2W            consumption

3                                                              4
Cortex-A9 MPCore Multicore Structure                                                                                                                                     Hard Macro Configuration and Floorplan
                                 Configurable Between 1 and                                        Hardware Coherence for
                                 4 CPUs with optional                                              Cache, MMU and TLB
                                 NEON and/or Floating-point                                        maintenance operations
                                 Unit

                   FPU/NEON            TRACE         FPU/NEON          TRACE      FPU/NEON            TRACE            FPU/NEON             TRACE



                   Cortex-A9 CPU                     Cortex-A9 CPU                Cortex-A9 CPU                        Cortex-A9 CPU              Coherent access to
Flexible configuration                                                                                                                            processor caches
and power-aware                                                                                                                                   from accelerators
interrupt controller
                   Instruction    Data               Instruction    Data          Instruction    Data                  Instruction      Data      and DMA
                   Cache          Cache              Cache          Cache         Cache          Cache                 Cache            Cache


                                                                                                                                                                                                                          falcon_cpu floorplan
                                                      Snoop Control Unit (SCU)
                   Generalized                                                                                                      Accelerator                                                                            Osprey configuration includes level 2 cache controller
                   Interrupt Control                                                                                                Coherence                                                                              and Cortex A9 integration level
                   and Distribution                   Cache-2-Cache                Snoop                                            Port
                                                      Transfers                    Filtering
                                                                                                          Timers                                                                                                                Top level includes Coresight PTM, CTI and CTM
                                                                                                                                                                                                                                Implementation using r1p1 version of Cortex A9
                                                                                                                                                                                                                                      Dual core
                                                                                                                                                                                                                                      32k I$ and D$
                                                                   Advanced Bus Interface Unit
                                                                                                                                                                                                                                      NEON present on both cores
                                                                                                                                                    Design flexibility                                                                PTM interface present
   Secure and                                                                                                                                       over memory
   Virtualization aware                                                                                                                             throughput and                                                                    128 interrupts
   interrupt and IPI                                                                                                                                latency
   communications                                      L2 Cache Controller (PL310) 128K-8MB                                                                                                                                           ACP present
                                  Primary AMBA 3 64bit Interface                          Optional 2nd I/F with Address Filtering                                                                                                     Two AXI master ports
                                                                                                                                                                                 Elba top level floorpan                        Level 2 cache memories external (interface exposed)




   5                                                                                                                                                                       6




 Why is ARM looking at “G” processes?                                                                                                                                    Understanding power
       “G” can achieve around double the MHz than “LP”                                                                                                                      Fundamental power parameters
       Active power is lower on “G” than “LP”                                                                                                                                     Average power                    => battery life
                                                                                                                                                                                  Thermal Power                    sustained power @ max performance
       Example, Push 40LP to 800MHz, to compare with 800MHz MID macro
                                                                                                                                                                                          GUI updates                     web page render
                                                                                                                                                                                                           music
                                                                                                     The estimated LP numbers
                                                                                                    correlate to an accelerated
                                                                                                    implementation of an A8
                                                                                                                                                                         Power
                                                                                                                                                                                                                                                      Traditional LP process
                                                                                                     G is close in terms of power if
                                                                                                    lowered to same performance as                                                                                                 2-3x faster
                                                                                                    on LP.
                                                                                                                                                                                                                                      clock
                                                                                                                                                                         Power                                                                         40G process
                                                                                                    G can scale much higher in terms
                                                                                                    of performance than LP.


                                                                                                    Key requirement is “run and                                                              power
                                                                                                    power off” quickly                                                                       off              power off              power off
                                                                                                                                                                         Power                                                                         Osprey




   7                                                                                                                                                                       8
Power Domains                                                                              Single-thread Coremarks/MHz
 HiP and MID macros have same power                                                         Single-thread performance is key for GUI based applications
 domains                                         A9_PL310
     Both use distributed coarse grain power
                                                  A9_PL310_noram
     switches
     Power plan for CPUs is symmetric             “Osprey macro”
                                                                                                    Atom                                1.85
 A9 core and its L1 is power gated in               Data                       Data
                                                    Engine 0                   Engine 1
 lockstep




                                                                   PTM/Debug
                                                                                                Cortex-A9                                                     2.95
 Note that all power domains are only ON                                       A9 CORE 1
                                                    A9 CORE 0
 or OFF, there is no hardware retention             + 32K I/D                  + 32K I/D
                                                                                                Cortex-A8                                              2.72
 mode
     Software routine enables retention to RAM      SCU + PL310_noram
                                                                                                   1004K                                       2.33


                                                    L2 Cache RAM                                     74K                                       2.30
                                                    512/1024KB
                                                                                                            0.00   0.50   1.00   1.50   2.00    2.50     3.00




9                                                                                          10




Floating Point Performance                                                                 Higher Flash Actionscript from A9




                                         Intel




11                                                                                         12
ARM Architecture evolution                                                                Dummies’ guide to Si implementation
 Some not-entirely-RISC features                                                           Basic Fab tech
         LDM / STM                                                                             65nm, 40nm, 32nm, 28nm, etc.
         Full predicated execution (ADDEQ r0, r1, r2)                                      G vs. LP technology
         Carefully designed with customer/partner input considering gatecount                  40G is 0.9V process, 40LP is 1.1V process
                                                                                               Much lower leakage with LP, but half the performance
 Thumb
                                                                                               Intermediate “LPG” from TSMC too! Island of G within LP
         16-bit instruction set (mostly using r0-r7) selected for compiler requirements
                                                                                           Vt’s – each Vt requires additional mask step
         Design goals: performance from 16-bit wide ROM, codesize                              HVt – lower leakage, but slower
         Thumb-2 in Cortex-A extends original Thumb (allows 32-bit/16-bit mix)                 RVt – regular Vt
             Beneficial today – better performance from small caches                           LVt – faster, but high leakage esp. at high temperature
 Jazelle                                                                                   Cell library track size
         CPU mode allows direct execution of Java bytecodes                                    9-track, 12-track, 15-track (bigger => more powerful)
         ~60% of Java bytecodes directly executed by datapath                              Backed off implementation vs. pushed implementation
         Top of Java stack stored in registers                                             High-K metal Gate
         Widely used in Nokia & DoCoMo handsets                                            Clock gating
     …                                                                                     Well biasing…


13                                                                                        14




ARM Architecture Evolution                                                                What is NEON?
                                                                                           NEON is a wide SIMD data processing architecture
   Key Technology
     Additions by                                                                               Extension of the ARM instruction set
Architecture Generation                                                                         32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)
                                            Thumb-EE
                                                                       Execution           NEON Instructions perform “Packed SIMD” processing
                                              VFPv3
                                                                     Environments:              Registers are considered as vectors of elements of the same data type
                             ARM11                                     Improved
                                                                      memory use                Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
                                             NEON™
                                            Adv SIMD                                            Instructions perform the same operation in all lanes
                                                                   Improved
                          Thumb®-2                                 Media and                                                                                       Source
                                                                                                                                                                   Source
                                                                     DSP                                                                                          Registers
                                                                                                                                                                  Registers
           ARM9          TrustZone™                                                            Elements
                                                                                                                                                     Dn

           ARM10                                                                                                                                     Dm
                             SIMD                                          Low Cost                                                                                     Operation
                                                                             MCU
           VFPv2
                                                                                                                                                     Dd                  Destination
          Jazelle®                                          Thumb-2 Only                                                                                                  Register

             V5               V6             V7 A&R             V7 M                                                           Lane


15                                                                                        16
Data Types                                                                                                  Registers
  NEON natively supports a set of common data types                                                          NEON provides a 256-byte register file
         Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit                                               Distinct from the core registers
         32-bit Single-precision Floating-point                                                                  Extension to the VFPv2 register file (VFPv3)

                                                                          .S8
Signed, Unsigned
 8/16-bit Signed,                                           .I8                                                                                                     D0
     Integers;                            .8                              .U8                                Two explicitly aliased views                                        Q0
Unsigned Integers;                                                                                                                                                  D1
   Polynomials                                                    .P8                                            32 x 64-bit registers (D0-D31)
   Polynomials                                                                                                                                                      D2
                                                                                                                                                                                 Q1
                                                                         .S16
                                                        .I16                                                     16 x 128-bit registers (Q0-Q15)                    D3
                                          .16                            .U16
                                                                  .P16                                                                                              :             :

                                                        .I32
                                                                         .S32                                Enables register trade-off                             D30
       32-bit Signed,                     .32                            .U32             64-bit Signed,         Vector length                                                   Q15
                                                                                                                                                                    D31
         Unsigned                                                 .F32                      Unsigned
      Integers; Floats                                                   .S64                Integers;           Available registers
                                          .64           .I64
                                                                         .U64
                                                                                                             Also uses the summary flags in the VFP FPSCR
                                                                                                                 Adds a QC integer saturation summary flag
  Data types are represented using a bit-size and format letter                                                  No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit -> 32-bit)

 17                                                                                                         18




Vectors and Scalars                                                                                         NEON in Audio
  Registers hold one or more elements of the same data type                                                  FFT: 256-point, 16-bit signed complex numbers
         Vn can be used to reference either a 64-bit Dn or 128-bit Qn register                                   FFT is a key component of AAC, Voice/pattern recognition etc.
         A register, data type combination describes a vector of elements
                                                                                                                 Hand optimized assembler in both cases
  63                            0               127                                                0                 FFT time           No NEON            With NEON
                 Dn                                                            Qn
                                                                                                                                        (v6 SIMD asm)      (v7 NEON asm)
                 I64                D0                F32           F32             F32      F32       Q0            Cortex-A8 500MHz   15.2 us            3.8 us
         S32            S32         D7           S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8       Q7            Actual silicon                        (x 4.0 performance)


               64-bit                                                     128-bit



  Some instructions can reference individual scalar elements
                                                                                                             Extreme example: FFT in ffmpeg: 12x faster
         Scalar elements are referenced using the array notation Vn[x]
                          F32             F32           F32              F32                                     C code -> handwitten asm
                                                                                    Q0
                         Q0[3]           Q0[2]         Q0[1]            Q0[0]                                    Scalar -> vector processing
  Array ordering is always from the least significant bit                                                        Unpipelined FPU -> pipelined NEON single precision FPU


 19                                                                                                         20
How to use NEON                                                                                 For NEON instruction reference
OpenMAX DL library                                                                               Official NEON instruction Set reference is “Advanced SIMD” in
         Library of common codec components and signal processing routines                       ARM Architecture Reference Manual v7 A & R edition
         Status: Released on http://www.arm.com/products/esd/openmax_home.html
                                                                                                 Available to partners & www.arm.com request system
Vectorizing Compilers
         Exploits NEON SIMD automatically with existing source code
         Status: Released (in RVDS 3.1 Professional and later)
         Status: Codesourcery 2007q3 gcc and later

C Instrinsics
         C function call interface to NEON operations
         Supports all data types and operations supported by NEON
         Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc)

Assembler
         For those who really want to optimize at the lowest level
         Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas)



 21                                                                                             22




ARM RVDS & gcc vectorising compiler                                                             Intrinsics
                                                                                                 Include intrinsics header file
                                                              |L1.16|
                                                                   VLD1.32 {d0,d1},[r0]!
                                                                                                       #include <arm_neon.h>
int a[256], b[256], c[256];
                                                                   SUBS    r3,r3,#1
foo () {                      armcc -S --cpu cortex-a8             VLD1.32 {d2,d3},[r1]!
  int i;                      -O3 -Otime --vectorize test.c        VADD.I32 q0,q0,q1             Use special NEON data types which correspond to D and Q registers, e.g.
                                                                   VST1.32 {d0,d1},[r2]!
                                                                   BNE |L1.16|
                                                                                                      int8x8_t      D-register containing 8x 8-bit elements
    for (i=0; i<256; i++){
                                                                                                      int16x4_t     D-register containing 4x 16-bit elements
      a[i] = b[i] + c[i];
                                                                                                      int32x4_t     Q-register containing 4x 32-bit elements
    }
}                                                             .L2:
                                                                     add r1, r0, ip
                                                                     add r3, r0, lr              Use special intrinsics versions of NEON instructions
                                                                     add r2, r0, r4
                              gcc -S -O3 -mcpu=cortex-a8
                                                                     add r0, r0, #8
                                                                                                      vin1 = vld1q_s32(ptr);
                              -mfpu=neon -ftree-vectorize            cmp r0, #1024                    vout = vaddq_s32(vin1, vin2);
                              -ftree-vectorizer-verbose=6            fldd d7, [r3, #0]                vst1q_s32(vout, ptr);
                              test.c                                 fldd d6, [r2, #0]
                                                                     vadd.i32      d7, d7, d6
                                                                     fstd d7, [r1, #0]
                                                                     bne .L2                     Strongly typed!
    armcc generates better NEON code                                                                 Use vreinterpret_s16_s32( ) to change the type
    (gcc can use Q-regs with ‘-mvectorize-with-neon-quad’ )
 23                                                                                             24
NEON in opensource                                                     Many different levels of parallelism
 Bluez – official Linux Bluetooth protocol stack
      NEON sbc audio encoder
 Pixman (part of cairo 2D graphics library)
      Compositing/alpha blending
      X.Org, Mozilla Firefox, fennec, & Webkit browsers
      e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON
                                                                        Multi-issue parallelism
 ffmpeg – libavcodec
      LGPL media player used in many Linux distros
      NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora
      NEON Audio: AAC, Vorbis, WMA
                                                                        NEON SIMD parallelism
 x264 – Google Summer Of Code 2009
      GPL H.264 encoder – e.g. for video conferencing
 Android – NEON optimizations                                           Multi-core parallelism
      Skia library, S32A_D565_Opaque 5x faster using NEON
      Available in Google Skia tree from 03-Aug-2009
 Eigen2 linear algebra library
 Ubuntu 09.04 – supports NEON
      NEON versions of critical shared-libraries



25                                                                     26




ffmpeg (libavcodec) performance                                        Scalability with SMP on Cortex-A9

 git.ffmpeg.org
 snapshot 21-Sep-09

 YouTube HQ video decode
 480x270, 30fps
 Including AAC audio


 Real silicon measurements
      OMAP3 Beagleboard
      ARM A9TC


 NEON ~2x overall
 performance


27                                                                     28
Skia library S32A_D565_Opaque
                                                         Size   Reference Google v6 NEON   RVDS
                                                                C         asm       asm
                                                         60     100%     128%      24%     64%


      NEON optimization example                          64     100%     128%      22%     68%


                                                         68     100%     127%      23%     63%


                                                         980    100%     73%       23%     58%


                                                         986    100%     73%       23%     58%




 29                                                 30




Processing code                                     Cortex-A8 TRM
vmovn.u16 d4, q12         vshr.u16 q8, q14, #5
vshr.u16 q11, q12, #5     vshr.u16 q9, q13, #6
vshr.u16 q10, q12, #6+5   vaddhn.u16 d6, q14, q8
vmovn.u16 d5, q11         vshr.u16 q8, q12, #5
vmovn.u16 d6, q10         vaddhn.u16 d5, q13, q9
vshl.u8  d4, d4, #3       vqadd.u8 d6, d6, d0
vshl.u8  d5, d5, #2       vaddhn.u16 d4, q12, q8
vshl.u8  d6, d6, #3

vmovl.u8   q14, d31       vqadd.u8    d6, d6, d0
vmovl.u8   q13, d31       vqadd.u8    d5, d5, d1
vmovl.u8   q12, d31       vqadd.u8    d4, d4, d2

vmvn.8     d30, d3        vshll.u8   q10, d6, #8
vmlal.u8   q14, d30, d6   vshll.u8   q3, d5, #8
vmlal.u8   q13, d30, d5   vshll.u8   q2, d4, #8
vmlal.u8   q12, d30, d4   vsri.u16   q10, q3, #5
                          vsri.u16   q10, q2, #11


 31                                                 32
Multiple 1-Element Structure Access
                                                                                     VLD1, VST1 provide standard array access
                                                                                            An array of structures containing a single component is a basic array
                                                                                            List can contain 1, 2, 3 or 4 consecutive registers
                                                                                            Transfer multiple consecutive 8, 16, 32 or 64-bit elements

                                                                                                                                   [R1]       x0
     Quick review of NEON instructions                                                                                                +2      x1
                                                                                         [R4]        x0                               +4      x2
                                                                                                +2   x1                               +6      x3
                                                                                     +R3        +4   x2                               +8      x4
                                                                                                +6   x3                               +10     x5
                                                                                                :         x3   x2   x1   x0   D7
                                                                                                                                      +12     x6         x3   x2   x1    x0   D3
                                                                                            VLD1.16 {D7}, [R4], R3                    +14     x7
                                                                                                                                                         x7   x6   x5    x4   D4
                                                                                                                                          :

                                                                                                                                      VST1.16 {D3,D4}, [R1]




33                                                                                  34




Addition: Basic                                                                     Example – adding all lanes
 NEON supports various useful forms of basic                                         Input in Q0 (D0 and D1)                                  DO                              D1
 addition
                                                       VADD.I16    D0,   D1,   D2    u16 input values                                         DO                              D1
 Normal Addition - VADD, VSUB                          VSUB.F32    Q7,   Q1,   Q4
      Floating-point                                   VADD.I8     Q15, Q14, Q15                                                               VPADDL.U16      Q0,      Q0
      Integer (8-bit to 64-bit elements)               VSUB.I64    D0,   D30, D5
      64-bit and 128-bit registers                                                                                                            DO                              D1

                                                                                     Now Q0 contains 4x u32 values                                 DO
 Long Addition - VADDL, VSUBL                          VADDL.U16    Q1, D7, D8
                                                                                     (with 15 headroom bits)
      Promotes both inputs before operation            VSUBL.S32    Q8, D1, D5                                                                 VPADD.U32      D0, D0, D1
      Signed/unsigned (8-bit to 32-bit source
      elements)
                                                                                     Reducing/folding operation                                     DO

                                                       VADDW.U8    Q1, Q7, D8
                                                                                     needs 1 bit of headroom
 Wide Addition - VADDW, VSUBW                                                                                                                       DO
                                                       VSUBW.S16   Q8, Q1, D5
      Promotes one input before operation
      Signed/unsigned (8-bit 32-bit source elements)                                                                                           VPADDL.U32      D0,      D0



35                                                                                  36
Exercise 2 - summing a vector
+
+
+
+                                                                           +

+                                                                           +                                                Some NEON clever features
+                                                                           +
+
+
                                                                            DO                                   D1
+
+
                                                                                            DO
+
+                                                                                           DO
+


37                                                                                                                      38




Data Movement: Table Lookup                                                                                             Element Load Store Instructions
    Uses byte indexes to control byte look up in a table                                                                 All treat memory as an array of structures (AoS)
        Table is a list of 1,2,3 or 4 adjacent registers                                                                      SIMD registers are treated as structure of arrays (SoA)
                                                                                                                              Enables interleaving/de-interleaving for efficient SIMD processing
                             11         4       8   13      26          8       0       3        D3                           Transfer up to 256-bits in a single instruction


                                                                                                                                        x3 z2 y2 x2 z1 y1 x1 z0 y0 x0
0        p    o    n   m    l       k       j       i       h       g       f       e       d     c   b   a   {D1,D2}
                                                                                                                                                   element        3-element structure

                                l       e       i       n       0       i       a       d        D0
                                                                                                                         Three forms of Element Load Store instructions are provided
                            VTBL.8 D0, {D1, D2}, D3
                                                                                                                         Forms distinguished by type of register list provided
                                                                                                                              Multiple Structure Access e.g. {D0, D1}
    VTBL : out of range indexes generate 0 result
                                                                                                                              Single Structure Access e.g. {D0[2], D1[2]}
    VTBX : out of range indexes leave destination unchanged                                                                   Single Structure Load to all lanes e.g. {D0[], D1[]}

39                                                                                                                      40
Multiple 2-Element Structure Access                                                            Multiple 3/4-Element Structure Access
 VLD2, VST2 provide access to multiple 2-element structures                                     VLD3/4, VST3/4 provide access to 3 or 4-element structures
       List can contain 2 or 4 registers                                                              Lists contain 3/4 registers; optional space for building 128-bit vectors
       Transfer multiple consecutive 8, 16, or 32-bit 2-element structures                            Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures

                                              [R3]        x0                                   [R1]       x0
                                                                                                                                                [R1]        x0
                                                     +2   y0                                        +2    y0
                                                                                                                                                      +2    y0
     [R1]       x0                                   +4   x1                                        +4    z0
                                                                                                                                                      +4    z0
        +2      y0                                   +6   y1                                        +6    x1
                                                                                                                                                 !    +6    x1
        +4      x1                                   +8   x2                                        +8    y1                                                         x3   x2    x1   x0   D0
                                                                                                                                                      +8    y1
        +6      y1                                +10     y2      x3   x2    x1    x0    D0         +10   z1
                                              !                                                                                                      +10    z1                            D1
        +8      x2                                +12     x3                                        +12   x2       x3   x2   x1   x0    D3
                                                                  x7   x6    x5    x4    D1                                                          +12    x2
        +10     y2                                   :                                                :                                                              y3   y2    y1   y0   D2
                                                                                                                   y3   y2   y1   y0    D4             :
        +12     x3   x3   x2   x1   x0   D2       +28     x7      y3   y2    y1    y0    D2         +20   y3
                                                                                                                                                     +20    y3                            D3
        +14     y3                                +30     y7                                        +22   z3       z3   z2   z1   z0    D5
                     y3   y2   y1   y0   D3                       y7   y6    y5    y4    D3                                                          +22    z3
            :                                        :                                                :                                                              z3   z2    z1   z0   D4
                                                                                                                                                       :

        VLD2.16 {D2,D3}, [R1]                 VLD2.16 {D0,D1,D2,D3}, [R3]!                          VST3.16 {D3,D4,D5}, [R1]
                                                                                                                                                     VLD3.16 {D0,D2,D4}, [R1]!



41                                                                                             42




Logical                                                                                        Alignment hints on NEON load/store
 NEON supports bitwise logical operations                                                       NEON data load/store: VLDn/VSTn
                                                                                                      Full unaligned support for NEON data access
                                                                                                      Instruction contains ‘alignment hint’ which permits implementations to be faster when
                                                          VAND         D0,   D0,    D1
                                                                                                      address is aligned and hint is specified.
 VAND, VBIC, VEORR, VORN, VORR                            VORR         Q0,   Q1,    Q15
                                                                                                           Usage: base address specified as [<Rn>:<align>]
       Bitwise logical operation                          VEOR         Q7,   Q1,    Q15
                                                                                                           Note it is a programming error to specify hint, but use incorrectly aligned address
                                                          VORN         D15, D14, D1
       Independent of data type                           VBIC         D0,   D30, D2
                                                                                                           Alignment hint can be :64, :128, :256 (bits) depending on number of D-regs
       64-bit and 128-bit registers
                                                                                                          VLD1.8        {D0},          [R1:64]
                                                                                          D0              VLD1.8        {D0,D1},       [R4:128]!
 VBIT, VBIF, VBSL                                                                         D1              VLD1.8        {D0,D1,D2,D3}, [R7:256]!, R2
       Bitwise multiplex operations
                                                                  0 1 0 1 1 0             D2               ARM ARM uses “@” but this is not recommended in source code
       Insert True, Insert False, Select
                                                                                                           GNU gas currently only accepts “[Rn,:128]” syntax – note extra “,”
       3 versions overwrite different registers
                                                                                          D1
       64-bit and 128-bit registers                                                                   Applies to both Cortex-A8 and Cortex-A9 (see TRM for detailed instruction timing)
       Used with masks to provide selection                      VBIT D1, D0, D2


43                                                                                             44
Dual issue [Cortex-A8 only]                                                  Thank you!
 NEON can dual issue NEON in the following circumstances                      ARM Architecture has evolved with a balance of pure RISC
     No register operand/result dependencies
                                                                              and customer driven input
     NEON data processing (ALU) instruction
     NEON load/store or NEON byte permute instruction or MRC/MCR
          VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL,
          VTBX
                                                                              NEON offers a clean architecture targeted at compiler code
        VLD1.8        {D0}, [R1]!                                             generation, offering
        VMLAL.S8      Q2, D3, D2                                                  Unaligned access
                                                                                  Structure load/store operations
        VEXT.8        D0, D1, D2, #1                                              Dual D-register/Q-register view to optimize register bank
        SUBS          r12, r12, #1
                                                                                  Balance of performance vs. gatecount


 Also can dual-issue NEON with ARM instructions                               Cortex-A9 and ARM’s hard macros provide a scalable low-
        VLD1.8        {D0}, [R1]!                                             power solution that is suitable for a wide range of high-
        SUBS          r12, r12, #1                                            performance consumer applications

45                                                                           46

More Related Content

What's hot

Storage virtualization
Storage virtualizationStorage virtualization
Storage virtualizationramya1591
 
FPGA Overview
FPGA OverviewFPGA Overview
FPGA OverviewMetalMath
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with GangliaFastly
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentationVishal Singh
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecturemuhammedsalihabbas
 
VLSI Design Methodologies
VLSI Design MethodologiesVLSI Design Methodologies
VLSI Design MethodologiesKeshav
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processorsArun Kumar
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPUKoan-Sin Tan
 
Embedded system design challenges
Embedded system design challenges Embedded system design challenges
Embedded system design challenges Aditya Kamble
 
Timeline of Processors
Timeline of ProcessorsTimeline of Processors
Timeline of ProcessorsDevraj Goswami
 
Delivering the Future of High-Performance Computing
Delivering the Future of High-Performance ComputingDelivering the Future of High-Performance Computing
Delivering the Future of High-Performance ComputingAMD
 

What's hot (20)

Storage virtualization
Storage virtualizationStorage virtualization
Storage virtualization
 
Fpga Knowledge
Fpga KnowledgeFpga Knowledge
Fpga Knowledge
 
Fpga
FpgaFpga
Fpga
 
FPGA Overview
FPGA OverviewFPGA Overview
FPGA Overview
 
SOC design
SOC design SOC design
SOC design
 
Rtos
RtosRtos
Rtos
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with Ganglia
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Trends in computer architecture
Trends in computer architectureTrends in computer architecture
Trends in computer architecture
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
ASIC
ASICASIC
ASIC
 
dbt 101
dbt 101dbt 101
dbt 101
 
VLSI Design Methodologies
VLSI Design MethodologiesVLSI Design Methodologies
VLSI Design Methodologies
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processors
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...High performance computing tutorial, with checklist and tips to optimize clus...
High performance computing tutorial, with checklist and tips to optimize clus...
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Embedded system design challenges
Embedded system design challenges Embedded system design challenges
Embedded system design challenges
 
Timeline of Processors
Timeline of ProcessorsTimeline of Processors
Timeline of Processors
 
Delivering the Future of High-Performance Computing
Delivering the Future of High-Performance ComputingDelivering the Future of High-Performance Computing
Delivering the Future of High-Performance Computing
 

Similar to Lect.10.arm soc.4 neon

MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationSun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationxKinAnx
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano GiordanoGoWireless
 
Stefano Giordano
Stefano  GiordanoStefano  Giordano
Stefano GiordanoGoWireless
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
Sears Point Racetrack
Sears Point RacetrackSears Point Racetrack
Sears Point RacetrackDino, llc
 
New solutions for wireless infrastructure applications
New solutions for wireless infrastructure applicationsNew solutions for wireless infrastructure applications
New solutions for wireless infrastructure applicationschiportal
 
Mpc5121 econfs
Mpc5121 econfsMpc5121 econfs
Mpc5121 econfsDino, llc
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...Heechul Yun
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsAnalog Devices, Inc.
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1RBratton
 
Memory hir
Memory hirMemory hir
Memory hirInsani89
 

Similar to Lect.10.arm soc.4 neon (20)

Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Gpu archi
Gpu archiGpu archi
Gpu archi
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
ARM Cortex M3 (2).pptx
ARM Cortex M3 (2).pptxARM Cortex M3 (2).pptx
ARM Cortex M3 (2).pptx
 
Pentium iii
Pentium iiiPentium iii
Pentium iii
 
UNIT 2.pptx
UNIT 2.pptxUNIT 2.pptx
UNIT 2.pptx
 
Nehalem
NehalemNehalem
Nehalem
 
XMC4000 Brochure
XMC4000 BrochureXMC4000 Brochure
XMC4000 Brochure
 
Cc430f6137
Cc430f6137Cc430f6137
Cc430f6137
 
Sun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentationSun sparc enterprise t5140 and t5240 servers technical presentation
Sun sparc enterprise t5140 and t5240 servers technical presentation
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano Giordano
 
Stefano Giordano
Stefano  GiordanoStefano  Giordano
Stefano Giordano
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Sears Point Racetrack
Sears Point RacetrackSears Point Racetrack
Sears Point Racetrack
 
New solutions for wireless infrastructure applications
New solutions for wireless infrastructure applicationsNew solutions for wireless infrastructure applications
New solutions for wireless infrastructure applications
 
Mpc5121 econfs
Mpc5121 econfsMpc5121 econfs
Mpc5121 econfs
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin Processors
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1
 
Memory hir
Memory hirMemory hir
Memory hir
 

More from sean chen

Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_finalsean chen
 
Uvm dcon2013
Uvm dcon2013Uvm dcon2013
Uvm dcon2013sean chen
 
Example my hdl
Example my hdlExample my hdl
Example my hdlsean chen
 
0021.system partitioning
0021.system partitioning0021.system partitioning
0021.system partitioningsean chen
 
0015.register allocation-graph-coloring
0015.register allocation-graph-coloring0015.register allocation-graph-coloring
0015.register allocation-graph-coloringsean chen
 
0006.scheduling not-ilp-not-force
0006.scheduling not-ilp-not-force0006.scheduling not-ilp-not-force
0006.scheduling not-ilp-not-forcesean chen
 
Dominator tree
Dominator treeDominator tree
Dominator treesean chen
 
Image scalar hw_algorithm
Image scalar hw_algorithmImage scalar hw_algorithm
Image scalar hw_algorithmsean chen
 
Virtual platform
Virtual platformVirtual platform
Virtual platformsean chen
 

More from sean chen (20)

Demo
DemoDemo
Demo
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_final
 
Uvm dcon2013
Uvm dcon2013Uvm dcon2013
Uvm dcon2013
 
Example my hdl
Example my hdlExample my hdl
Example my hdl
 
0021.system partitioning
0021.system partitioning0021.system partitioning
0021.system partitioning
 
0015.register allocation-graph-coloring
0015.register allocation-graph-coloring0015.register allocation-graph-coloring
0015.register allocation-graph-coloring
 
0006.scheduling not-ilp-not-force
0006.scheduling not-ilp-not-force0006.scheduling not-ilp-not-force
0006.scheduling not-ilp-not-force
 
Lecture07
Lecture07Lecture07
Lecture07
 
Lecture04
Lecture04Lecture04
Lecture04
 
Lecture03
Lecture03Lecture03
Lecture03
 
Dominator tree
Dominator treeDominator tree
Dominator tree
 
Work items
Work itemsWork items
Work items
 
Work items
Work itemsWork items
Work items
 
ocelot
ocelotocelot
ocelot
 
Image scalar hw_algorithm
Image scalar hw_algorithmImage scalar hw_algorithm
Image scalar hw_algorithm
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
 
Spi
SpiSpi
Spi
 
Serializer
SerializerSerializer
Serializer
 
Defense
DefenseDefense
Defense
 
Defense
DefenseDefense
Defense
 

Lect.10.arm soc.4 neon

  • 1. A brief history of ARM First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors 50mm2, consumed 120mW of power ARM Architecture & NEON Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline ARM founded in October 1990, separate company (Apple had 43% stake) ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994 Ian Rickards Stanford University 28 Apr 2010 1 2 ARM 25 years later: Cortex-A9 MP Cortex-A9 Processor Microarchitecture 1-4 way MP with optimized MESI Introduces out-of- order instruction 16KB, 32KB, 64KB I & D caches issue and completion 128KB-8MB L2 Multi-issue, Speculation, Renaming, OOO Register renaming to High performance FPU option enable execution speculation NEON SIMD option Thumb-2 Non-blocking memory system AXI bus with load-store forwarding Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON) Fast loop mode in instruction pre- 40G “Low Power” macro: ~5mm2, 800MHz, 0.5W fetch to lower power 40G “High Performance” macro: ~7mm2 2GHz (typ), 2W consumption 3 4
  • 2. Cortex-A9 MPCore Multicore Structure Hard Macro Configuration and Floorplan Configurable Between 1 and Hardware Coherence for 4 CPUs with optional Cache, MMU and TLB NEON and/or Floating-point maintenance operations Unit FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Coherent access to Flexible configuration processor caches and power-aware from accelerators interrupt controller Instruction Data Instruction Data Instruction Data Instruction Data and DMA Cache Cache Cache Cache Cache Cache Cache Cache falcon_cpu floorplan Snoop Control Unit (SCU) Generalized Accelerator Osprey configuration includes level 2 cache controller Interrupt Control Coherence and Cortex A9 integration level and Distribution Cache-2-Cache Snoop Port Transfers Filtering Timers Top level includes Coresight PTM, CTI and CTM Implementation using r1p1 version of Cortex A9 Dual core 32k I$ and D$ Advanced Bus Interface Unit NEON present on both cores Design flexibility PTM interface present Secure and over memory Virtualization aware throughput and 128 interrupts interrupt and IPI latency communications L2 Cache Controller (PL310) 128K-8MB ACP present Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering Two AXI master ports Elba top level floorpan Level 2 cache memories external (interface exposed) 5 6 Why is ARM looking at “G” processes? Understanding power “G” can achieve around double the MHz than “LP” Fundamental power parameters Active power is lower on “G” than “LP” Average power => battery life Thermal Power sustained power @ max performance Example, Push 40LP to 800MHz, to compare with 800MHz MID macro GUI updates web page render music The estimated LP numbers correlate to an accelerated implementation of an A8 Power Traditional LP process G is close in terms of power if lowered to same performance as 2-3x faster on LP. clock Power 40G process G can scale much higher in terms of performance than LP. Key requirement is “run and power power off” quickly off power off power off Power Osprey 7 8
  • 3. Power Domains Single-thread Coremarks/MHz HiP and MID macros have same power Single-thread performance is key for GUI based applications domains A9_PL310 Both use distributed coarse grain power A9_PL310_noram switches Power plan for CPUs is symmetric “Osprey macro” Atom 1.85 A9 core and its L1 is power gated in Data Data Engine 0 Engine 1 lockstep PTM/Debug Cortex-A9 2.95 Note that all power domains are only ON A9 CORE 1 A9 CORE 0 or OFF, there is no hardware retention + 32K I/D + 32K I/D Cortex-A8 2.72 mode Software routine enables retention to RAM SCU + PL310_noram 1004K 2.33 L2 Cache RAM 74K 2.30 512/1024KB 0.00 0.50 1.00 1.50 2.00 2.50 3.00 9 10 Floating Point Performance Higher Flash Actionscript from A9 Intel 11 12
  • 4. ARM Architecture evolution Dummies’ guide to Si implementation Some not-entirely-RISC features Basic Fab tech LDM / STM 65nm, 40nm, 32nm, 28nm, etc. Full predicated execution (ADDEQ r0, r1, r2) G vs. LP technology Carefully designed with customer/partner input considering gatecount 40G is 0.9V process, 40LP is 1.1V process Much lower leakage with LP, but half the performance Thumb Intermediate “LPG” from TSMC too! Island of G within LP 16-bit instruction set (mostly using r0-r7) selected for compiler requirements Vt’s – each Vt requires additional mask step Design goals: performance from 16-bit wide ROM, codesize HVt – lower leakage, but slower Thumb-2 in Cortex-A extends original Thumb (allows 32-bit/16-bit mix) RVt – regular Vt Beneficial today – better performance from small caches LVt – faster, but high leakage esp. at high temperature Jazelle Cell library track size CPU mode allows direct execution of Java bytecodes 9-track, 12-track, 15-track (bigger => more powerful) ~60% of Java bytecodes directly executed by datapath Backed off implementation vs. pushed implementation Top of Java stack stored in registers High-K metal Gate Widely used in Nokia & DoCoMo handsets Clock gating … Well biasing… 13 14 ARM Architecture Evolution What is NEON? NEON is a wide SIMD data processing architecture Key Technology Additions by Extension of the ARM instruction set Architecture Generation 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide) Thumb-EE Execution NEON Instructions perform “Packed SIMD” processing VFPv3 Environments: Registers are considered as vectors of elements of the same data type ARM11 Improved memory use Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float NEON™ Adv SIMD Instructions perform the same operation in all lanes Improved Thumb®-2 Media and Source Source DSP Registers Registers ARM9 TrustZone™ Elements Dn ARM10 Dm SIMD Low Cost Operation MCU VFPv2 Dd Destination Jazelle® Thumb-2 Only Register V5 V6 V7 A&R V7 M Lane 15 16
  • 5. Data Types Registers NEON natively supports a set of common data types NEON provides a 256-byte register file Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit Distinct from the core registers 32-bit Single-precision Floating-point Extension to the VFPv2 register file (VFPv3) .S8 Signed, Unsigned 8/16-bit Signed, .I8 D0 Integers; .8 .U8 Two explicitly aliased views Q0 Unsigned Integers; D1 Polynomials .P8 32 x 64-bit registers (D0-D31) Polynomials D2 Q1 .S16 .I16 16 x 128-bit registers (Q0-Q15) D3 .16 .U16 .P16 : : .I32 .S32 Enables register trade-off D30 32-bit Signed, .32 .U32 64-bit Signed, Vector length Q15 D31 Unsigned .F32 Unsigned Integers; Floats .S64 Integers; Available registers .64 .I64 .U64 Also uses the summary flags in the VFP FPSCR Adds a QC integer saturation summary flag Data types are represented using a bit-size and format letter No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit -> 32-bit) 17 18 Vectors and Scalars NEON in Audio Registers hold one or more elements of the same data type FFT: 256-point, 16-bit signed complex numbers Vn can be used to reference either a 64-bit Dn or 128-bit Qn register FFT is a key component of AAC, Voice/pattern recognition etc. A register, data type combination describes a vector of elements Hand optimized assembler in both cases 63 0 127 0 FFT time No NEON With NEON Dn Qn (v6 SIMD asm) (v7 NEON asm) I64 D0 F32 F32 F32 F32 Q0 Cortex-A8 500MHz 15.2 us 3.8 us S32 S32 D7 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 Q7 Actual silicon (x 4.0 performance) 64-bit 128-bit Some instructions can reference individual scalar elements Extreme example: FFT in ffmpeg: 12x faster Scalar elements are referenced using the array notation Vn[x] F32 F32 F32 F32 C code -> handwitten asm Q0 Q0[3] Q0[2] Q0[1] Q0[0] Scalar -> vector processing Array ordering is always from the least significant bit Unpipelined FPU -> pipelined NEON single precision FPU 19 20
  • 6. How to use NEON For NEON instruction reference OpenMAX DL library Official NEON instruction Set reference is “Advanced SIMD” in Library of common codec components and signal processing routines ARM Architecture Reference Manual v7 A & R edition Status: Released on http://www.arm.com/products/esd/openmax_home.html Available to partners & www.arm.com request system Vectorizing Compilers Exploits NEON SIMD automatically with existing source code Status: Released (in RVDS 3.1 Professional and later) Status: Codesourcery 2007q3 gcc and later C Instrinsics C function call interface to NEON operations Supports all data types and operations supported by NEON Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc) Assembler For those who really want to optimize at the lowest level Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas) 21 22 ARM RVDS & gcc vectorising compiler Intrinsics Include intrinsics header file |L1.16| VLD1.32 {d0,d1},[r0]! #include <arm_neon.h> int a[256], b[256], c[256]; SUBS r3,r3,#1 foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]! int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1 Use special NEON data types which correspond to D and Q registers, e.g. VST1.32 {d0,d1},[r2]! BNE |L1.16| int8x8_t D-register containing 8x 8-bit elements for (i=0; i<256; i++){ int16x4_t D-register containing 4x 16-bit elements a[i] = b[i] + c[i]; int32x4_t Q-register containing 4x 32-bit elements } } .L2: add r1, r0, ip add r3, r0, lr Use special intrinsics versions of NEON instructions add r2, r0, r4 gcc -S -O3 -mcpu=cortex-a8 add r0, r0, #8 vin1 = vld1q_s32(ptr); -mfpu=neon -ftree-vectorize cmp r0, #1024 vout = vaddq_s32(vin1, vin2); -ftree-vectorizer-verbose=6 fldd d7, [r3, #0] vst1q_s32(vout, ptr); test.c fldd d6, [r2, #0] vadd.i32 d7, d7, d6 fstd d7, [r1, #0] bne .L2 Strongly typed! armcc generates better NEON code Use vreinterpret_s16_s32( ) to change the type (gcc can use Q-regs with ‘-mvectorize-with-neon-quad’ ) 23 24
  • 7. NEON in opensource Many different levels of parallelism Bluez – official Linux Bluetooth protocol stack NEON sbc audio encoder Pixman (part of cairo 2D graphics library) Compositing/alpha blending X.Org, Mozilla Firefox, fennec, & Webkit browsers e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON Multi-issue parallelism ffmpeg – libavcodec LGPL media player used in many Linux distros NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora NEON Audio: AAC, Vorbis, WMA NEON SIMD parallelism x264 – Google Summer Of Code 2009 GPL H.264 encoder – e.g. for video conferencing Android – NEON optimizations Multi-core parallelism Skia library, S32A_D565_Opaque 5x faster using NEON Available in Google Skia tree from 03-Aug-2009 Eigen2 linear algebra library Ubuntu 09.04 – supports NEON NEON versions of critical shared-libraries 25 26 ffmpeg (libavcodec) performance Scalability with SMP on Cortex-A9 git.ffmpeg.org snapshot 21-Sep-09 YouTube HQ video decode 480x270, 30fps Including AAC audio Real silicon measurements OMAP3 Beagleboard ARM A9TC NEON ~2x overall performance 27 28
  • 8. Skia library S32A_D565_Opaque Size Reference Google v6 NEON RVDS C asm asm 60 100% 128% 24% 64% NEON optimization example 64 100% 128% 22% 68% 68 100% 127% 23% 63% 980 100% 73% 23% 58% 986 100% 73% 23% 58% 29 30 Processing code Cortex-A8 TRM vmovn.u16 d4, q12 vshr.u16 q8, q14, #5 vshr.u16 q11, q12, #5 vshr.u16 q9, q13, #6 vshr.u16 q10, q12, #6+5 vaddhn.u16 d6, q14, q8 vmovn.u16 d5, q11 vshr.u16 q8, q12, #5 vmovn.u16 d6, q10 vaddhn.u16 d5, q13, q9 vshl.u8 d4, d4, #3 vqadd.u8 d6, d6, d0 vshl.u8 d5, d5, #2 vaddhn.u16 d4, q12, q8 vshl.u8 d6, d6, #3 vmovl.u8 q14, d31 vqadd.u8 d6, d6, d0 vmovl.u8 q13, d31 vqadd.u8 d5, d5, d1 vmovl.u8 q12, d31 vqadd.u8 d4, d4, d2 vmvn.8 d30, d3 vshll.u8 q10, d6, #8 vmlal.u8 q14, d30, d6 vshll.u8 q3, d5, #8 vmlal.u8 q13, d30, d5 vshll.u8 q2, d4, #8 vmlal.u8 q12, d30, d4 vsri.u16 q10, q3, #5 vsri.u16 q10, q2, #11 31 32
  • 9. Multiple 1-Element Structure Access VLD1, VST1 provide standard array access An array of structures containing a single component is a basic array List can contain 1, 2, 3 or 4 consecutive registers Transfer multiple consecutive 8, 16, 32 or 64-bit elements [R1] x0 Quick review of NEON instructions +2 x1 [R4] x0 +4 x2 +2 x1 +6 x3 +R3 +4 x2 +8 x4 +6 x3 +10 x5 : x3 x2 x1 x0 D7 +12 x6 x3 x2 x1 x0 D3 VLD1.16 {D7}, [R4], R3 +14 x7 x7 x6 x5 x4 D4 : VST1.16 {D3,D4}, [R1] 33 34 Addition: Basic Example – adding all lanes NEON supports various useful forms of basic Input in Q0 (D0 and D1) DO D1 addition VADD.I16 D0, D1, D2 u16 input values DO D1 Normal Addition - VADD, VSUB VSUB.F32 Q7, Q1, Q4 Floating-point VADD.I8 Q15, Q14, Q15 VPADDL.U16 Q0, Q0 Integer (8-bit to 64-bit elements) VSUB.I64 D0, D30, D5 64-bit and 128-bit registers DO D1 Now Q0 contains 4x u32 values DO Long Addition - VADDL, VSUBL VADDL.U16 Q1, D7, D8 (with 15 headroom bits) Promotes both inputs before operation VSUBL.S32 Q8, D1, D5 VPADD.U32 D0, D0, D1 Signed/unsigned (8-bit to 32-bit source elements) Reducing/folding operation DO VADDW.U8 Q1, Q7, D8 needs 1 bit of headroom Wide Addition - VADDW, VSUBW DO VSUBW.S16 Q8, Q1, D5 Promotes one input before operation Signed/unsigned (8-bit 32-bit source elements) VPADDL.U32 D0, D0 35 36
  • 10. Exercise 2 - summing a vector + + + + + + + Some NEON clever features + + + + DO D1 + + DO + + DO + 37 38 Data Movement: Table Lookup Element Load Store Instructions Uses byte indexes to control byte look up in a table All treat memory as an array of structures (AoS) Table is a list of 1,2,3 or 4 adjacent registers SIMD registers are treated as structure of arrays (SoA) Enables interleaving/de-interleaving for efficient SIMD processing 11 4 8 13 26 8 0 3 D3 Transfer up to 256-bits in a single instruction x3 z2 y2 x2 z1 y1 x1 z0 y0 x0 0 p o n m l k j i h g f e d c b a {D1,D2} element 3-element structure l e i n 0 i a d D0 Three forms of Element Load Store instructions are provided VTBL.8 D0, {D1, D2}, D3 Forms distinguished by type of register list provided Multiple Structure Access e.g. {D0, D1} VTBL : out of range indexes generate 0 result Single Structure Access e.g. {D0[2], D1[2]} VTBX : out of range indexes leave destination unchanged Single Structure Load to all lanes e.g. {D0[], D1[]} 39 40
  • 11. Multiple 2-Element Structure Access Multiple 3/4-Element Structure Access VLD2, VST2 provide access to multiple 2-element structures VLD3/4, VST3/4 provide access to 3 or 4-element structures List can contain 2 or 4 registers Lists contain 3/4 registers; optional space for building 128-bit vectors Transfer multiple consecutive 8, 16, or 32-bit 2-element structures Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures [R3] x0 [R1] x0 [R1] x0 +2 y0 +2 y0 +2 y0 [R1] x0 +4 x1 +4 z0 +4 z0 +2 y0 +6 y1 +6 x1 ! +6 x1 +4 x1 +8 x2 +8 y1 x3 x2 x1 x0 D0 +8 y1 +6 y1 +10 y2 x3 x2 x1 x0 D0 +10 z1 ! +10 z1 D1 +8 x2 +12 x3 +12 x2 x3 x2 x1 x0 D3 x7 x6 x5 x4 D1 +12 x2 +10 y2 : : y3 y2 y1 y0 D2 y3 y2 y1 y0 D4 : +12 x3 x3 x2 x1 x0 D2 +28 x7 y3 y2 y1 y0 D2 +20 y3 +20 y3 D3 +14 y3 +30 y7 +22 z3 z3 z2 z1 z0 D5 y3 y2 y1 y0 D3 y7 y6 y5 y4 D3 +22 z3 : : : z3 z2 z1 z0 D4 : VLD2.16 {D2,D3}, [R1] VLD2.16 {D0,D1,D2,D3}, [R3]! VST3.16 {D3,D4,D5}, [R1] VLD3.16 {D0,D2,D4}, [R1]! 41 42 Logical Alignment hints on NEON load/store NEON supports bitwise logical operations NEON data load/store: VLDn/VSTn Full unaligned support for NEON data access Instruction contains ‘alignment hint’ which permits implementations to be faster when VAND D0, D0, D1 address is aligned and hint is specified. VAND, VBIC, VEORR, VORN, VORR VORR Q0, Q1, Q15 Usage: base address specified as [<Rn>:<align>] Bitwise logical operation VEOR Q7, Q1, Q15 Note it is a programming error to specify hint, but use incorrectly aligned address VORN D15, D14, D1 Independent of data type VBIC D0, D30, D2 Alignment hint can be :64, :128, :256 (bits) depending on number of D-regs 64-bit and 128-bit registers VLD1.8 {D0}, [R1:64] D0 VLD1.8 {D0,D1}, [R4:128]! VBIT, VBIF, VBSL D1 VLD1.8 {D0,D1,D2,D3}, [R7:256]!, R2 Bitwise multiplex operations 0 1 0 1 1 0 D2 ARM ARM uses “@” but this is not recommended in source code Insert True, Insert False, Select GNU gas currently only accepts “[Rn,:128]” syntax – note extra “,” 3 versions overwrite different registers D1 64-bit and 128-bit registers Applies to both Cortex-A8 and Cortex-A9 (see TRM for detailed instruction timing) Used with masks to provide selection VBIT D1, D0, D2 43 44
  • 12. Dual issue [Cortex-A8 only] Thank you! NEON can dual issue NEON in the following circumstances ARM Architecture has evolved with a balance of pure RISC No register operand/result dependencies and customer driven input NEON data processing (ALU) instruction NEON load/store or NEON byte permute instruction or MRC/MCR VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL, VTBX NEON offers a clean architecture targeted at compiler code VLD1.8 {D0}, [R1]! generation, offering VMLAL.S8 Q2, D3, D2 Unaligned access Structure load/store operations VEXT.8 D0, D1, D2, #1 Dual D-register/Q-register view to optimize register bank SUBS r12, r12, #1 Balance of performance vs. gatecount Also can dual-issue NEON with ARM instructions Cortex-A9 and ARM’s hard macros provide a scalable low- VLD1.8 {D0}, [R1]! power solution that is suitable for a wide range of high- SUBS r12, r12, #1 performance consumer applications 45 46