5. Established Common Sense
“90% of execution time is spent in 10% of the code”
“Programmers are really bad at knowing what really needs
optimizing so should be guided by profiling”
“Why should I bother optimizing X? It only runs N times per
frame”
“Processors are really fast these days so it doesn’t matter”
“The compiler can optimize that better than I can”
6. Alternative View
The compiler isn’t always as smart as you think it is.
Really bad things can happen in the most innocent looking code
because of huge penalties inherent in the architecture.
A generally sub-optimal code base will nickle and dime you for a
big chunk of your frame rate.
It’s easier to write more efficient code up front than to go and find
it all later with a profiler.
7. PPU Hardware Threads
We have two of them running on alternate cycles
If one stalls, other thread runs
Not multi-core:
Shared exection units
Shared access to memory and cache
Most registers duplicated
Ideal usage:
Threads filling in stalls for each other without thrashing cache
8. PS3 Cache
Level 1 Instruction Cache
32 kb 2-way set associative, 128 byte cache line
Level 1 Data Cache
32 kb 4-way set associative, 128 byte cache line
Write-through to L2
Level 2 Data and Instruction Cache
512 kb 8-way set associative
Write-back
128 byte cache line
9. Cache Miss Penalties
L1 cache miss = 40 cycles
L2 cache miss = ~1000 cycles!
In other words, random reads from memory are excruciatingly
expensive!
Reading data with large strides – very bad
Consider smaller data structures, or group data that will be read
together
10. Virtual Functions
What happens when you call a virtual function?
What does this code do?
virtual void Update() {}
May touch cache line at vtable address unnecessarily
Consider batching by type for better iCache pattern
If you know the type, maybe you don’t need the virtual – save
touching memory to read the function address
Even better – maybe the data you actually want to manipulate
can be kept close together in memory?
12. Spot the Stall
int SlowFunction(int & a, int & b)
{
a = 1;
b = 2;
return a + b;
}
13. Method 1
Q: When will this work and when won’t it?
inline int FastFunction(int & a, int & b)
{
a = 1;
b = 2;
return a + b;
}
14. Method 2
int FastFunction(int * __restrict a,
int * __restrict b)
{
*a = 1;
*b = 2;
return *a + *b; // we promise that a != b
}
15. __restrict Keyword
__restrict only works with pointers, not references (which
sucks).
Aliasing only applies to identical types.
Can be applied to implicit this pointer in member functions.
Put it after the closing brace.
Stops compiler worrying that you passed a class data member
to the function.
16. Load-Hit-Store
What is it?
Write followed by read
PPU store queue
Average latency 40 cycles
Snoop bits 52 through 61 (implications?)
17. True LHS
Type casting between register files:
float floatValue = (float)intValue;
float posY = vPosition.X();
Data member as loop counter
while( m_Counter-- ) {}
Aliasing:
void swap( int & a, int & b ) { int t = a; a = b; b = t; }
18. Workaround: Data member as loop counter
int counter = m_Counter; // load into register
while( counter-- ) // register will decrement
{
doSomething();
}
m_Counter = 0; // Store to memory just once
19. Workarounds
Keep data in the same domain as much as possible
Reorder your writes and reads to allow space for latency
Consider using word flags instead of many packed bools
Load flags word into register
Perform logical bitwise operations on flags in register
Store new flags value
e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) |
newcanSeeFlag | kAngry;
20. False LHS Case
Store queue snooping only compares bits 52 through 61
So false LHS occurs if you read from address while different item
in queue matches addr & 0xFFC.
Writing to and reading from memory 4KB apart
Writing to and reading from memory on sub-word boundary, e.g.
packed short, byte, bool
Write a bool then read a nearby one -> ~40 cycle stall
26. Branch Hints
Branch mis-prediction can hurt your performance
24 cycles penalty to flush the instruction queue
If you know a certain branch is rarely taken, you can use a static
branch hint, e.g.
if ( __builtin_expect( bResult, 0 ) )
Far better to eliminate the branch!
Use logical bitwise operations to mask results
This is far easier and more applicable in SIMD code
27. Floating Point Branch Elimination
__fsel, __fsels – Floating point select
float min( float a, float b )
{
return ( a < b ) ? a : b;
}
float min( float a, float b )
{
return __fsels( a – b, b, a );
}
28. Microcoded Instructions
Single instruction -> several, fetched from ROM, pipeline bubble
Common example of one to avoid: shift immediate.
int a = b << c;
Minimum 11 cycle latency.
If you know range of values, can be better to switch to a fixed shift!
switch( c )
{
case 1: a = b << 1; break;
case 2: a = b << 2; break;
case 3: a = b << 3; break;
default: break;
}
29. Loop Unrolling
Why unroll loops?
Less branches
Better concurrency, instruction pipelining, hide latency
More opportunities for compiler to optimise
Only works if code can actually be interleaved, e.g. inline functions, no
inter-iteration dependencies
How many times is enough?
On average about 4 – 6 times works well
Best for loops where num iterations is known up front
Need to think about spare - iterationCount % unrollCount
30. Picking up the Spare
If you can, artificially pad your data with safe values to keep as multiple of
unroll count. In this example, you might process up to 3 dummy items in
the worst case.
for ( int i = 0; i < numElements; i += 4 )
{
InlineTransformation( pElements[ i+0 ] );
InlineTransformation( pElements[ i+1 ] );
InlineTransformation( pElements[ i+2 ] );
InlineTransformation( pElements[ i+3 ] );
}
31. Picking up the Spare
If you can’t pad, run for numElements & ~3 instead and run to
completion in a second loop.
for ( ; i < numElements; ++i )
{
InlineTransformation( pElements[ i ] );
}
32. Picking up the Spare
Alternative method (pros and cons - one branch but longer code generated).
switch( numElements & 3 )
{
case 3: InlineTransformation( pElements[ i+2 ] );
case 2: InlineTransformation( pElements[ i+1 ] );
case 1: InlineTransformation( pElements[ i+0 ] );
case 0: break;
}
33. Loop unrolled… now what?
If you unrolled your loop 4 times, you might be able to use SIMD
Use AltiVec intrinsics – align your data to 16 bytes
128 bit registers - operate on 4 32-bit values in parallel
Most SIMD instructions have 1 cycle throughput
Consider using SOA data instead of AOS
AOS: Arrays of interleaved posX, posY, posZ structures
SOA: A structure of arrays for each field dimensioned for all elements