SlideShare a Scribd company logo
1 of 14
Download to read offline
Antwan Hätälä
Umbra 3 Lead programmer
Boosting your ARM
mobile 3D rendering
performance with Umbra 3
INDEX
• Who are we?
• Games
• What is Umbra 3 and occlusion culling
• bringing our system to the PlayStation 4
• experiences and benefits
• lessons learned
UMBRA
SOFTWARE
Occlusion culling middleware
for 3D games
Founded in 2007
14 employees
Based in Helsinki, Finland
Support office in Seattle, WA
Same problem – Different solutions
Mo Money – Mo Problems
“Level artists are there to fill the
world with content. Integrating Umbra saved us
not only artist time but the time to create and
maintain an efficient visibility culling solution.
Umbra’s support provides us with the solutions
and features that we need.”
“Umbra’s technology is playing an important role
in the creation of our next universe, by freeing our artists
from the burden of manual markups typically associated
with polygon soup.”
Occlusion
culling basics
Occlusion Culling: Why bother?
• Process and render only whats visible
• improved frame rate and rendering performance
• allows you to put more detail into levels and create larger levels
6
What is Umbra ?
7
 Determines visible objects fast to save further work both on CPU and GPU
 Rasterizes automatically generated proprietary occluder models on CPU
 Operates in low resolution, generates conservative (dilated) results
 Rasterization is embarassingly parallel in nature
 Parallellize across CPU cores
 Process multiple pixels/elements in SIMD
 Optimized for SSE, Altivec, Cell and ARM NEON
Umbra 3 occluder rasterizer
8
 Processing of multiple data elements (2 to 16) in single instruction
 Separate execution pipeline: can execute in parallel with ARM
 Separate register file: 16 128-bit regs (or 32 64-bit), SP floats or 8-64 bit integers
 Mandatory in Cortex-A8/A12/A15, optional in Cortex-A9
 For mobile 3D title purposes, it will be there
 Actual cycle counts will vary: 64-bit vs 128-bit, single vs dual issue, latencies
 For multi-platform, target A9 and enjoy free benefits on more advanced platforms
 Used in one of three ways
 Inline assembly
 Compiler intrinsics
 Compiler auto-vectorization
 Similar to SSE, Altivec but for best performance you need to know your platform
NEON overview
9
 Collaborate with the compiler, but keep an eye on the output
 Align your data when possible
 Inline functions that operate on SIMD values
 Use __restrict to let compiler reorder
 Watch for register spilling
 Schedule enough NEON work, even when it might be redundant
 Loading data from ARM registers is relatively cheap, storing back is expensive
 Hide load/store latencies by interleaving with computation (unroll your loops)
 Never interleave VFP instructions with NEON
 Means pipeline flush, tens of cycles of penalty
 Watch for ”s” register use is compiler output
NEON common best practices
10
 No penalty from interleaving 2-wide ops with 4-wide ops
 Cortex-A8/A9 does 64-bit float operations per cycle
 vget_high_xxx, vget_low_xxx to address quadword halves
 Narrow to 64 bits early
 16x4 and 8x8 are also 64 bits, for many operations 32 bits per channel not needed
 Even if CPU can churn out 128 bits per cycle, savings to be had in result latency etc.
 Use VMOVN or coupled operation and narrow
 Careful with your constants
 VMOV and VMVN can encode lots of useful constants
 Compilers do a good job of constant encoding, but can’t choose the constants for you
 Killer instructions
 Shift-and-insert: VSRI, VSLI
 Byte permute by table lookup: VTBL, VTBX
 Gather load and scatter store: VLD2-4, VST2-4
NEON optimization tricks
11
 Example routine: gather sign bits of large array of float values
NEON optimization example
function gather_signbits(flt_array):
let output_bitmap = bitmap of size len(flt_array)
foreach elem in flt_array at index idx:
if (elem < 0)
set_bit(output_bitmap, idx)
else
clear_bit(output_bitmap,idx)
12
 Sufficient unrolling: handle 16 elements
in one iteration
 compare 4 values per instruction
 bitwise and for correct bit offsets
 collapse with vertical or (pairwise add)
Neon optimization example: first attempt
20: add.w r2, r0, #32
24: vld1.64 {d28-d29}, [r0 :128]
28: vld1.64 {d24-d25}, [r2 :128]
2c: add.w r2, r0, #16
30: vclt.f32 q14, q14, #0
34: vld1.64 {d26-d27}, [r2 :128]
38: add.w r2, r0, #48 ; 0x30
3c: vclt.f32 q12, q12, #0
40: vand q14, q8, q14
44: vld1.64 {d30-d31}, [r2 :128]
48: vclt.f32 q13, q13, #0
4c: vand q13, q11, q13
50: vclt.f32 q15, q15, #0
54: vand q12, q10, q12
58: vand q15, q9, q15
5c: vorr q13, q14, q13
60: vorr q12, q12, q15
64: vorr q12, q13, q12
68: vpadd.i32 d24, d24, d25
6c: vpadd.i32 d24, d24, d24
70: vst1.32 {d24[0]}, [r0 :32], r1
13
 Compare with zero = shift sign bit
 Can shift and combine simultaneously
with VSRI instruction
 Narrow to 16 bits (VMOVN) before
proceeding further
 half the amount of constants
Neon optimization example: shift-and-insert, narrow
early
18: vld1.64 {d18-d19}, [r0 :128]
1c: add.w r3, r0, #16
20: adds r1, #4
22: vshr.u32 q9, q9, #19
26: vld1.64 {d20-d21}, [r3 :128]
2a: add.w r3, r0, #32
2e: vsri.32 q9, q10, #23
32: vld1.64 {d20-d21}, [r3 :128]
36: add.w r3, r0, #48 ; 0x30
3a: vsri.32 q9, q10, #27
3e: vld1.64 {d20-d21}, [r3 :128]
42: vsri.32 q9, q10, #31
46: vmovn.i32 d18, q9
4a: vand d18, d18, d16
4e: vshl.u16 d18, d18, d17
52: vpaddl.u16 d18, d18
56: vpadd.i32 d18, d18, d18
5a: vst1.32 {d18[0]}, [r0 :32], r2
Thank you.
For more on Umbra 3, go to:
umbra3.com
antti@umbrasoftware.com
Follow us on Twitter @umbrasoftware

More Related Content

What's hot

Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
 
A Simple Communication System Design Lab #2 with MATLAB Simulink
A Simple Communication System Design Lab #2 with MATLAB SimulinkA Simple Communication System Design Lab #2 with MATLAB Simulink
A Simple Communication System Design Lab #2 with MATLAB SimulinkJaewook. Kang
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowMarina Kolpakova
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemMarina Kolpakova
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
SCaml compiler
SCaml compilerSCaml compiler
SCaml compilerJun Furuse
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
A Simple Communication System Design Lab #3 with MATLAB Simulink
A Simple Communication System Design Lab #3 with MATLAB SimulinkA Simple Communication System Design Lab #3 with MATLAB Simulink
A Simple Communication System Design Lab #3 with MATLAB SimulinkJaewook. Kang
 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 otoyinc
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkJaewook. Kang
 
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCBryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCMLconf
 
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCMLconf
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bitsChiou-Nan Chen
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 

What's hot (20)

Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
A Simple Communication System Design Lab #2 with MATLAB Simulink
A Simple Communication System Design Lab #2 with MATLAB SimulinkA Simple Communication System Design Lab #2 with MATLAB Simulink
A Simple Communication System Design Lab #2 with MATLAB Simulink
 
[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
SCaml compiler
SCaml compilerSCaml compiler
SCaml compiler
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
A Simple Communication System Design Lab #3 with MATLAB Simulink
A Simple Communication System Design Lab #3 with MATLAB SimulinkA Simple Communication System Design Lab #3 with MATLAB Simulink
A Simple Communication System Design Lab #3 with MATLAB Simulink
 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB Simulink
 
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCBryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
64-bit Android
64-bit Android64-bit Android
64-bit Android
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 

Similar to GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra

General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareDaniel Blezek
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton insertsChris Adkin
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISAGanesan Narayanasamy
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Introduction to FreeRTOS
Introduction to FreeRTOSIntroduction to FreeRTOS
Introduction to FreeRTOSICS
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
 
The reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryThe reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryPVS-Studio
 
Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kazuhito Ohkawa
 
64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processorToru Nishimura
 
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...Compuware
 
Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAGanesan Narayanasamy
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector ComputerHaris456
 
Effisiensi prog atmel
Effisiensi prog atmelEffisiensi prog atmel
Effisiensi prog atmelrm_dhozooo
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]Aleksei Voitylov
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 

Similar to GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra (20)

General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISA
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Introduction to FreeRTOS
Introduction to FreeRTOSIntroduction to FreeRTOS
Introduction to FreeRTOS
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
The reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memoryThe reasons why 64-bit programs require more stack memory
The reasons why 64-bit programs require more stack memory
 
Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例
 
64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor64bit SMP OS for TILE-Gx many core processor
64bit SMP OS for TILE-Gx many core processor
 
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
 
Advanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISAAdvanced High-Performance Computing Features of the Open Power ISA
Advanced High-Performance Computing Features of the Open Power ISA
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
 
Effisiensi prog atmel
Effisiensi prog atmelEffisiensi prog atmel
Effisiensi prog atmel
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 

More from Umbra Software

GDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas Trudel
GDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas TrudelGDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas Trudel
GDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas TrudelUmbra Software
 
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongGDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongUmbra Software
 
Umbra Ignite 2015: – Remy Chinchilla & Kevin Cerdà AAA indie production for ...
Umbra Ignite 2015: –  Remy Chinchilla & Kevin Cerdà AAA indie production for ...Umbra Ignite 2015: –  Remy Chinchilla & Kevin Cerdà AAA indie production for ...
Umbra Ignite 2015: – Remy Chinchilla & Kevin Cerdà AAA indie production for ...Umbra Software
 
Umbra Ignite 2015: Balázs Török – The blanket that’s always too short
Umbra Ignite 2015: Balázs Török – The blanket that’s always too shortUmbra Ignite 2015: Balázs Török – The blanket that’s always too short
Umbra Ignite 2015: Balázs Török – The blanket that’s always too shortUmbra Software
 
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...Umbra Software
 
Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...
Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...
Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...Umbra Software
 
Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture ...
Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture ...Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture ...
Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture ...Umbra Software
 
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...Umbra Software
 
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...Umbra Software
 

More from Umbra Software (9)

GDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas Trudel
GDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas TrudelGDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas Trudel
GDC16: Improving geometry culling for Deus Ex: Mankind Divided by Nicolas Trudel
 
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongGDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
 
Umbra Ignite 2015: – Remy Chinchilla & Kevin Cerdà AAA indie production for ...
Umbra Ignite 2015: –  Remy Chinchilla & Kevin Cerdà AAA indie production for ...Umbra Ignite 2015: –  Remy Chinchilla & Kevin Cerdà AAA indie production for ...
Umbra Ignite 2015: – Remy Chinchilla & Kevin Cerdà AAA indie production for ...
 
Umbra Ignite 2015: Balázs Török – The blanket that’s always too short
Umbra Ignite 2015: Balázs Török – The blanket that’s always too shortUmbra Ignite 2015: Balázs Török – The blanket that’s always too short
Umbra Ignite 2015: Balázs Török – The blanket that’s always too short
 
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
 
Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...
Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...
Umbra Ignite 2015: Thor Gunnarsson & Reynir Hardarson – Nailing AAA quality i...
 
Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture ...
Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture ...Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture ...
Umbra Ignite 2015: Jérémy Virga – Dishonored 2 rendering engine architecture ...
 
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
Umbra Ignite 2015: Graham Wihlidal – Adapting a technology stream to ever-evo...
 
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
 

Recently uploaded

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 

Recently uploaded (20)

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 

GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra

  • 1. Antwan Hätälä Umbra 3 Lead programmer Boosting your ARM mobile 3D rendering performance with Umbra 3
  • 2. INDEX • Who are we? • Games • What is Umbra 3 and occlusion culling • bringing our system to the PlayStation 4 • experiences and benefits • lessons learned
  • 3. UMBRA SOFTWARE Occlusion culling middleware for 3D games Founded in 2007 14 employees Based in Helsinki, Finland Support office in Seattle, WA Same problem – Different solutions Mo Money – Mo Problems “Level artists are there to fill the world with content. Integrating Umbra saved us not only artist time but the time to create and maintain an efficient visibility culling solution. Umbra’s support provides us with the solutions and features that we need.” “Umbra’s technology is playing an important role in the creation of our next universe, by freeing our artists from the burden of manual markups typically associated with polygon soup.”
  • 5. Occlusion Culling: Why bother? • Process and render only whats visible • improved frame rate and rendering performance • allows you to put more detail into levels and create larger levels
  • 7. 7  Determines visible objects fast to save further work both on CPU and GPU  Rasterizes automatically generated proprietary occluder models on CPU  Operates in low resolution, generates conservative (dilated) results  Rasterization is embarassingly parallel in nature  Parallellize across CPU cores  Process multiple pixels/elements in SIMD  Optimized for SSE, Altivec, Cell and ARM NEON Umbra 3 occluder rasterizer
  • 8. 8  Processing of multiple data elements (2 to 16) in single instruction  Separate execution pipeline: can execute in parallel with ARM  Separate register file: 16 128-bit regs (or 32 64-bit), SP floats or 8-64 bit integers  Mandatory in Cortex-A8/A12/A15, optional in Cortex-A9  For mobile 3D title purposes, it will be there  Actual cycle counts will vary: 64-bit vs 128-bit, single vs dual issue, latencies  For multi-platform, target A9 and enjoy free benefits on more advanced platforms  Used in one of three ways  Inline assembly  Compiler intrinsics  Compiler auto-vectorization  Similar to SSE, Altivec but for best performance you need to know your platform NEON overview
  • 9. 9  Collaborate with the compiler, but keep an eye on the output  Align your data when possible  Inline functions that operate on SIMD values  Use __restrict to let compiler reorder  Watch for register spilling  Schedule enough NEON work, even when it might be redundant  Loading data from ARM registers is relatively cheap, storing back is expensive  Hide load/store latencies by interleaving with computation (unroll your loops)  Never interleave VFP instructions with NEON  Means pipeline flush, tens of cycles of penalty  Watch for ”s” register use is compiler output NEON common best practices
  • 10. 10  No penalty from interleaving 2-wide ops with 4-wide ops  Cortex-A8/A9 does 64-bit float operations per cycle  vget_high_xxx, vget_low_xxx to address quadword halves  Narrow to 64 bits early  16x4 and 8x8 are also 64 bits, for many operations 32 bits per channel not needed  Even if CPU can churn out 128 bits per cycle, savings to be had in result latency etc.  Use VMOVN or coupled operation and narrow  Careful with your constants  VMOV and VMVN can encode lots of useful constants  Compilers do a good job of constant encoding, but can’t choose the constants for you  Killer instructions  Shift-and-insert: VSRI, VSLI  Byte permute by table lookup: VTBL, VTBX  Gather load and scatter store: VLD2-4, VST2-4 NEON optimization tricks
  • 11. 11  Example routine: gather sign bits of large array of float values NEON optimization example function gather_signbits(flt_array): let output_bitmap = bitmap of size len(flt_array) foreach elem in flt_array at index idx: if (elem < 0) set_bit(output_bitmap, idx) else clear_bit(output_bitmap,idx)
  • 12. 12  Sufficient unrolling: handle 16 elements in one iteration  compare 4 values per instruction  bitwise and for correct bit offsets  collapse with vertical or (pairwise add) Neon optimization example: first attempt 20: add.w r2, r0, #32 24: vld1.64 {d28-d29}, [r0 :128] 28: vld1.64 {d24-d25}, [r2 :128] 2c: add.w r2, r0, #16 30: vclt.f32 q14, q14, #0 34: vld1.64 {d26-d27}, [r2 :128] 38: add.w r2, r0, #48 ; 0x30 3c: vclt.f32 q12, q12, #0 40: vand q14, q8, q14 44: vld1.64 {d30-d31}, [r2 :128] 48: vclt.f32 q13, q13, #0 4c: vand q13, q11, q13 50: vclt.f32 q15, q15, #0 54: vand q12, q10, q12 58: vand q15, q9, q15 5c: vorr q13, q14, q13 60: vorr q12, q12, q15 64: vorr q12, q13, q12 68: vpadd.i32 d24, d24, d25 6c: vpadd.i32 d24, d24, d24 70: vst1.32 {d24[0]}, [r0 :32], r1
  • 13. 13  Compare with zero = shift sign bit  Can shift and combine simultaneously with VSRI instruction  Narrow to 16 bits (VMOVN) before proceeding further  half the amount of constants Neon optimization example: shift-and-insert, narrow early 18: vld1.64 {d18-d19}, [r0 :128] 1c: add.w r3, r0, #16 20: adds r1, #4 22: vshr.u32 q9, q9, #19 26: vld1.64 {d20-d21}, [r3 :128] 2a: add.w r3, r0, #32 2e: vsri.32 q9, q10, #23 32: vld1.64 {d20-d21}, [r3 :128] 36: add.w r3, r0, #48 ; 0x30 3a: vsri.32 q9, q10, #27 3e: vld1.64 {d20-d21}, [r3 :128] 42: vsri.32 q9, q10, #31 46: vmovn.i32 d18, q9 4a: vand d18, d18, d16 4e: vshl.u16 d18, d18, d17 52: vpaddl.u16 d18, d18 56: vpadd.i32 d18, d18, d18 5a: vst1.32 {d18[0]}, [r0 :32], r2
  • 14. Thank you. For more on Umbra 3, go to: umbra3.com antti@umbrasoftware.com Follow us on Twitter @umbrasoftware

Editor's Notes

  1. Hello everybody! My name is Antti Hätälä, I am the tech lead at Umbra software. Thomas Puha, developer relations manager. Thank you all for coming. I am here to talk about what we have accomplished with the Umbra 3 visibility system how the technology is being used to power up some very exiting titles how it can help your title as well
  2. A little bit of background. (soap story) Umbra software is an independent team of computer graphics geeks based in Helsinki, Finland We have kept going at it since 2007 – and individually even before that. Thoughout the years we’ve been attacking the problem from various angles Permanent presence in the US
  3. The same with a images: only process what you see! Doing this allows you to add more detail to the visible part of the world in the same frame budget! In short, better looking games that run faster. Not all 3D games or environments will significantly benefit from occlusion culling. Games with top-down views, mostly transparent elements or stationary cameras etc.
  4. Available in Apple processors from iPhone 4 onwards Android armv7 target requires NEON