SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Engineer Learning Session #1
Optimisation Tips for PowerPC

Ben Hanke
April 16th 2009
Playstation 3 PPU
Optimized Playstation 3 PPU
Consoles with PowerPC
Established Common Sense
   “90% of execution time is spent in 10% of the code”
   “Programmers are really bad at knowing what really needs
    optimizing so should be guided by profiling”
   “Why should I bother optimizing X? It only runs N times per
    frame”
   “Processors are really fast these days so it doesn’t matter”
   “The compiler can optimize that better than I can”
Alternative View
   The compiler isn’t always as smart as you think it is.
   Really bad things can happen in the most innocent looking code
    because of huge penalties inherent in the architecture.
   A generally sub-optimal code base will nickle and dime you for a
    big chunk of your frame rate.
   It’s easier to write more efficient code up front than to go and find
    it all later with a profiler.
PPU Hardware Threads
   We have two of them running on alternate cycles
   If one stalls, other thread runs
   Not multi-core:
      Shared exection units
      Shared access to memory and cache
      Most registers duplicated
   Ideal usage:
      Threads filling in stalls for each other without thrashing cache
PS3 Cache
   Level 1 Instruction Cache
      32 kb 2-way set associative, 128 byte cache line
   Level 1 Data Cache
      32 kb 4-way set associative, 128 byte cache line
      Write-through to L2
   Level 2 Data and Instruction Cache
      512 kb 8-way set associative
      Write-back
      128 byte cache line
Cache Miss Penalties
   L1 cache miss = 40 cycles
   L2 cache miss = ~1000 cycles!
   In other words, random reads from memory are excruciatingly
    expensive!
   Reading data with large strides – very bad
   Consider smaller data structures, or group data that will be read
    together
Virtual Functions
   What happens when you call a virtual function?
   What does this code do?
         virtual void Update() {}
   May touch cache line at vtable address unnecessarily
   Consider batching by type for better iCache pattern
   If you know the type, maybe you don’t need the virtual – save
    touching memory to read the function address
   Even better – maybe the data you actually want to manipulate
    can be kept close together in memory?
Data Hazards Ahead
Spot the Stall


int SlowFunction(int & a, int & b)
{
   a = 1;
   b = 2;
   return a + b;
}
Method 1


Q: When will this work and when won’t it?

inline int FastFunction(int & a, int & b)
{
   a = 1;
   b = 2;
   return a + b;
}
Method 2


int FastFunction(int * __restrict a,
                 int * __restrict b)
{
   *a = 1;
   *b = 2;
   return *a + *b; // we promise that a != b
}
__restrict Keyword


 __restrict only works with pointers, not references (which
sucks).
 Aliasing only applies to identical types.

 Can be applied to implicit this pointer in member functions.

    Put it after the closing brace.

    Stops compiler worrying that you passed a class data member

   to the function.
Load-Hit-Store

   What is it?

   Write followed by read
   PPU store queue

   Average latency 40 cycles

   Snoop bits 52 through 61 (implications?)
True LHS
   Type casting between register files:
      float floatValue = (float)intValue;
       float posY = vPosition.X();


   Data member as loop counter
      while( m_Counter-- ) {}


   Aliasing:
      void swap( int & a, int & b ) { int t = a; a = b; b = t; }
Workaround: Data member as loop counter

int counter = m_Counter; // load into register
while( counter-- ) // register will decrement
{
    doSomething();
}
m_Counter = 0; // Store to memory just once
Workarounds
   Keep data in the same domain as much as possible
   Reorder your writes and reads to allow space for latency
   Consider using word flags instead of many packed bools
      Load flags word into register
      Perform logical bitwise operations on flags in register
      Store new flags value

    e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) |
      newcanSeeFlag | kAngry;
False LHS Case
   Store queue snooping only compares bits 52 through 61
   So false LHS occurs if you read from address while different item
    in queue matches addr & 0xFFC.
   Writing to and reading from memory 4KB apart
   Writing to and reading from memory on sub-word boundary, e.g.
    packed short, byte, bool
   Write a bool then read a nearby one -> ~40 cycle stall
Example
struct BaddyState
{
    bool m_bAlive;
    bool m_bCanSeePlayer;
    bool m_bAngry;
    bool m_bWearingHat;
}; // 4 bytes
Where might we stall?
if ( m_bAlive )
{
    m_bCanSeePlayer = LineOfSightCheck( this, player );
    if ( m_bCanSeePlayer && !m_bAngry )
    {
        m_bAngry = true;
        StartShootingPlayer();
    }
}
Workaround
if ( m_bAlive ) // load, compare
{
    const bool bAngry = m_bAngry; // load
    const bool bCanSeePlayer = LineOfSightCheck( this, player );
    m_bCanSeePlayer = bCanSeePlayer; // store
    if (bCanSeePlayer && !bAngry ) // compare registers
    {
        m_bAngry = true; // store
        StartShootingPlayer();
    }
}
Loop + Singleton Gotcha
What happens here?

 for( int i = 0; i < enemyCount; ++i )
 {
     EnemyManager::Get().DispatchEnemy();
 }
Workaround

EnemyManager & enemyManager = EnemyManager::Get();
for( int i = 0; i < enemyCount; ++i )
{
    enemyManager.DispatchEnemy();
}
Branch Hints
   Branch mis-prediction can hurt your performance
   24 cycles penalty to flush the instruction queue
   If you know a certain branch is rarely taken, you can use a static
    branch hint, e.g.
      if ( __builtin_expect( bResult, 0 ) )

   Far better to eliminate the branch!
      Use logical bitwise operations to mask results
      This is far easier and more applicable in SIMD code
Floating Point Branch Elimination
   __fsel, __fsels – Floating point select

    float min( float a, float b )
    {
        return ( a < b ) ? a : b;
    }
    float min( float a, float b )
    {
        return __fsels( a – b, b, a );
    }
Microcoded Instructions
   Single instruction -> several, fetched from ROM, pipeline bubble
   Common example of one to avoid: shift immediate.
        int a = b << c;
   Minimum 11 cycle latency.
   If you know range of values, can be better to switch to a fixed shift!
       switch( c )
       {
            case 1: a = b << 1; break;
            case 2: a = b << 2; break;
            case 3: a = b << 3; break;
            default: break;
       }
Loop Unrolling
   Why unroll loops?
      Less branches
      Better concurrency, instruction pipelining, hide latency
      More opportunities for compiler to optimise
      Only works if code can actually be interleaved, e.g. inline functions, no
      inter-iteration dependencies
   How many times is enough?
      On average about 4 – 6 times works well
      Best for loops where num iterations is known up front
   Need to think about spare - iterationCount % unrollCount
Picking up the Spare
   If you can, artificially pad your data with safe values to keep as multiple of
    unroll count. In this example, you might process up to 3 dummy items in
    the worst case.

     for ( int i = 0; i < numElements; i += 4 )
     {
          InlineTransformation( pElements[ i+0 ] );
          InlineTransformation( pElements[ i+1 ] );
          InlineTransformation( pElements[ i+2 ] );
          InlineTransformation( pElements[ i+3 ] );
     }
Picking up the Spare
   If you can’t pad, run for numElements & ~3 instead and run to
    completion in a second loop.
    for ( ; i < numElements; ++i )
    {
          InlineTransformation( pElements[ i ] );
    }
Picking up the Spare
Alternative method (pros and cons - one branch but longer code generated).


  switch( numElements & 3 )
 {
 case 3: InlineTransformation( pElements[ i+2 ] );
 case 2: InlineTransformation( pElements[ i+1 ] );
 case 1: InlineTransformation( pElements[ i+0 ] );
 case 0: break;
 }
Loop unrolled… now what?
   If you unrolled your loop 4 times, you might be able to use SIMD
   Use AltiVec intrinsics – align your data to 16 bytes
   128 bit registers - operate on 4 32-bit values in parallel
   Most SIMD instructions have 1 cycle throughput
   Consider using SOA data instead of AOS
      AOS: Arrays of interleaved posX, posY, posZ structures
      SOA: A structure of arrays for each field dimensioned for all elements
Example: FloatToHalf
Example: FloatToHalf4
Questions?

Contenu connexe

Tendances

100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects Andrey Karpov
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gcexsuns
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC techniqueChun Hao Wang
 
PHP floating point precision
PHP floating point precisionPHP floating point precision
PHP floating point precisionDaniel_Rhodes
 
Cryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipherCryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipherNiloy Biswas
 
Chap06 block cipher operation
Chap06 block cipher operationChap06 block cipher operation
Chap06 block cipher operationNam Yong Kim
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
An eternal question of timing
An eternal question of timingAn eternal question of timing
An eternal question of timingPVS-Studio
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Piotr Przymus
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Piotr Przymus
 
Information and data security block cipher operation
Information and data security block cipher operationInformation and data security block cipher operation
Information and data security block cipher operationMazin Alwaaly
 
Block cipher modes of operation
Block cipher modes of operation Block cipher modes of operation
Block cipher modes of operation harshit chavda
 

Tendances (20)

opt-mem-trx
opt-mem-trxopt-mem-trx
opt-mem-trx
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 
PHP floating point precision
PHP floating point precisionPHP floating point precision
PHP floating point precision
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Cryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipherCryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipher
 
Chap06 block cipher operation
Chap06 block cipher operationChap06 block cipher operation
Chap06 block cipher operation
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
An eternal question of timing
An eternal question of timingAn eternal question of timing
An eternal question of timing
 
Modes of Operation
Modes of Operation Modes of Operation
Modes of Operation
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
 
Information and data security block cipher operation
Information and data security block cipher operationInformation and data security block cipher operation
Information and data security block cipher operation
 
Block cipher modes of operation
Block cipher modes of operation Block cipher modes of operation
Block cipher modes of operation
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
lec9_ref.pdf
lec9_ref.pdflec9_ref.pdf
lec9_ref.pdf
 
Caching in
Caching inCaching in
Caching in
 

Similaire à PPU Optimisation Lesson

Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfRomanKhavronenko
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projectsPVS-Studio
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0PVS-Studio
 
PVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernelPVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernelPVS-Studio
 
C optimization notes
C optimization notesC optimization notes
C optimization notesFyaz Ghaffar
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languagesAnkit Pandey
 

Similaire à PPU Optimisation Lesson (20)

Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdf
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Matopt
MatoptMatopt
Matopt
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0
 
PVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernelPVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernel
 
C optimization notes
C optimization notesC optimization notes
C optimization notes
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 

Plus de slantsixgames

Supersize your production pipe enjmin 2013 v1.1 hd
Supersize your production pipe    enjmin 2013 v1.1 hdSupersize your production pipe    enjmin 2013 v1.1 hd
Supersize your production pipe enjmin 2013 v1.1 hdslantsixgames
 
Ask the Producers Feb 8th
Ask the Producers Feb 8thAsk the Producers Feb 8th
Ask the Producers Feb 8thslantsixgames
 
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)slantsixgames
 
Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)slantsixgames
 
SCons an Introduction
SCons an IntroductionSCons an Introduction
SCons an Introductionslantsixgames
 
Confrontation Pipeline and SCons
Confrontation Pipeline and SConsConfrontation Pipeline and SCons
Confrontation Pipeline and SConsslantsixgames
 
Confrontation Audio GDC 2009
Confrontation Audio GDC 2009Confrontation Audio GDC 2009
Confrontation Audio GDC 2009slantsixgames
 
Audio SPU Presentation
Audio SPU PresentationAudio SPU Presentation
Audio SPU Presentationslantsixgames
 
Collision Detection an Overview
Collision Detection an OverviewCollision Detection an Overview
Collision Detection an Overviewslantsixgames
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overviewslantsixgames
 
Event System Presentation
Event System PresentationEvent System Presentation
Event System Presentationslantsixgames
 
Supersize Your Production Pipe
Supersize Your Production PipeSupersize Your Production Pipe
Supersize Your Production Pipeslantsixgames
 

Plus de slantsixgames (12)

Supersize your production pipe enjmin 2013 v1.1 hd
Supersize your production pipe    enjmin 2013 v1.1 hdSupersize your production pipe    enjmin 2013 v1.1 hd
Supersize your production pipe enjmin 2013 v1.1 hd
 
Ask the Producers Feb 8th
Ask the Producers Feb 8thAsk the Producers Feb 8th
Ask the Producers Feb 8th
 
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)
 
Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)
 
SCons an Introduction
SCons an IntroductionSCons an Introduction
SCons an Introduction
 
Confrontation Pipeline and SCons
Confrontation Pipeline and SConsConfrontation Pipeline and SCons
Confrontation Pipeline and SCons
 
Confrontation Audio GDC 2009
Confrontation Audio GDC 2009Confrontation Audio GDC 2009
Confrontation Audio GDC 2009
 
Audio SPU Presentation
Audio SPU PresentationAudio SPU Presentation
Audio SPU Presentation
 
Collision Detection an Overview
Collision Detection an OverviewCollision Detection an Overview
Collision Detection an Overview
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overview
 
Event System Presentation
Event System PresentationEvent System Presentation
Event System Presentation
 
Supersize Your Production Pipe
Supersize Your Production PipeSupersize Your Production Pipe
Supersize Your Production Pipe
 

Dernier

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

PPU Optimisation Lesson

  • 1. Engineer Learning Session #1 Optimisation Tips for PowerPC Ben Hanke April 16th 2009
  • 5. Established Common Sense  “90% of execution time is spent in 10% of the code”  “Programmers are really bad at knowing what really needs optimizing so should be guided by profiling”  “Why should I bother optimizing X? It only runs N times per frame”  “Processors are really fast these days so it doesn’t matter”  “The compiler can optimize that better than I can”
  • 6. Alternative View  The compiler isn’t always as smart as you think it is.  Really bad things can happen in the most innocent looking code because of huge penalties inherent in the architecture.  A generally sub-optimal code base will nickle and dime you for a big chunk of your frame rate.  It’s easier to write more efficient code up front than to go and find it all later with a profiler.
  • 7. PPU Hardware Threads  We have two of them running on alternate cycles  If one stalls, other thread runs  Not multi-core: Shared exection units Shared access to memory and cache Most registers duplicated  Ideal usage: Threads filling in stalls for each other without thrashing cache
  • 8. PS3 Cache  Level 1 Instruction Cache 32 kb 2-way set associative, 128 byte cache line  Level 1 Data Cache 32 kb 4-way set associative, 128 byte cache line Write-through to L2  Level 2 Data and Instruction Cache 512 kb 8-way set associative Write-back 128 byte cache line
  • 9. Cache Miss Penalties  L1 cache miss = 40 cycles  L2 cache miss = ~1000 cycles!  In other words, random reads from memory are excruciatingly expensive!  Reading data with large strides – very bad  Consider smaller data structures, or group data that will be read together
  • 10. Virtual Functions  What happens when you call a virtual function?  What does this code do? virtual void Update() {}  May touch cache line at vtable address unnecessarily  Consider batching by type for better iCache pattern  If you know the type, maybe you don’t need the virtual – save touching memory to read the function address  Even better – maybe the data you actually want to manipulate can be kept close together in memory?
  • 12. Spot the Stall int SlowFunction(int & a, int & b) { a = 1; b = 2; return a + b; }
  • 13. Method 1 Q: When will this work and when won’t it? inline int FastFunction(int & a, int & b) { a = 1; b = 2; return a + b; }
  • 14. Method 2 int FastFunction(int * __restrict a, int * __restrict b) { *a = 1; *b = 2; return *a + *b; // we promise that a != b }
  • 15. __restrict Keyword  __restrict only works with pointers, not references (which sucks).  Aliasing only applies to identical types.  Can be applied to implicit this pointer in member functions.  Put it after the closing brace.  Stops compiler worrying that you passed a class data member to the function.
  • 16. Load-Hit-Store  What is it?  Write followed by read  PPU store queue  Average latency 40 cycles  Snoop bits 52 through 61 (implications?)
  • 17. True LHS  Type casting between register files: float floatValue = (float)intValue; float posY = vPosition.X();  Data member as loop counter while( m_Counter-- ) {}  Aliasing: void swap( int & a, int & b ) { int t = a; a = b; b = t; }
  • 18. Workaround: Data member as loop counter int counter = m_Counter; // load into register while( counter-- ) // register will decrement { doSomething(); } m_Counter = 0; // Store to memory just once
  • 19. Workarounds  Keep data in the same domain as much as possible  Reorder your writes and reads to allow space for latency  Consider using word flags instead of many packed bools Load flags word into register Perform logical bitwise operations on flags in register Store new flags value e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) | newcanSeeFlag | kAngry;
  • 20. False LHS Case  Store queue snooping only compares bits 52 through 61  So false LHS occurs if you read from address while different item in queue matches addr & 0xFFC.  Writing to and reading from memory 4KB apart  Writing to and reading from memory on sub-word boundary, e.g. packed short, byte, bool  Write a bool then read a nearby one -> ~40 cycle stall
  • 21. Example struct BaddyState { bool m_bAlive; bool m_bCanSeePlayer; bool m_bAngry; bool m_bWearingHat; }; // 4 bytes
  • 22. Where might we stall? if ( m_bAlive ) { m_bCanSeePlayer = LineOfSightCheck( this, player ); if ( m_bCanSeePlayer && !m_bAngry ) { m_bAngry = true; StartShootingPlayer(); } }
  • 23. Workaround if ( m_bAlive ) // load, compare { const bool bAngry = m_bAngry; // load const bool bCanSeePlayer = LineOfSightCheck( this, player ); m_bCanSeePlayer = bCanSeePlayer; // store if (bCanSeePlayer && !bAngry ) // compare registers { m_bAngry = true; // store StartShootingPlayer(); } }
  • 24. Loop + Singleton Gotcha What happens here? for( int i = 0; i < enemyCount; ++i ) { EnemyManager::Get().DispatchEnemy(); }
  • 25. Workaround EnemyManager & enemyManager = EnemyManager::Get(); for( int i = 0; i < enemyCount; ++i ) { enemyManager.DispatchEnemy(); }
  • 26. Branch Hints  Branch mis-prediction can hurt your performance  24 cycles penalty to flush the instruction queue  If you know a certain branch is rarely taken, you can use a static branch hint, e.g. if ( __builtin_expect( bResult, 0 ) )  Far better to eliminate the branch! Use logical bitwise operations to mask results This is far easier and more applicable in SIMD code
  • 27. Floating Point Branch Elimination  __fsel, __fsels – Floating point select float min( float a, float b ) { return ( a < b ) ? a : b; } float min( float a, float b ) { return __fsels( a – b, b, a ); }
  • 28. Microcoded Instructions  Single instruction -> several, fetched from ROM, pipeline bubble  Common example of one to avoid: shift immediate. int a = b << c;  Minimum 11 cycle latency.  If you know range of values, can be better to switch to a fixed shift! switch( c ) { case 1: a = b << 1; break; case 2: a = b << 2; break; case 3: a = b << 3; break; default: break; }
  • 29. Loop Unrolling  Why unroll loops? Less branches Better concurrency, instruction pipelining, hide latency More opportunities for compiler to optimise Only works if code can actually be interleaved, e.g. inline functions, no inter-iteration dependencies  How many times is enough? On average about 4 – 6 times works well Best for loops where num iterations is known up front  Need to think about spare - iterationCount % unrollCount
  • 30. Picking up the Spare  If you can, artificially pad your data with safe values to keep as multiple of unroll count. In this example, you might process up to 3 dummy items in the worst case. for ( int i = 0; i < numElements; i += 4 ) { InlineTransformation( pElements[ i+0 ] ); InlineTransformation( pElements[ i+1 ] ); InlineTransformation( pElements[ i+2 ] ); InlineTransformation( pElements[ i+3 ] ); }
  • 31. Picking up the Spare  If you can’t pad, run for numElements & ~3 instead and run to completion in a second loop. for ( ; i < numElements; ++i ) { InlineTransformation( pElements[ i ] ); }
  • 32. Picking up the Spare Alternative method (pros and cons - one branch but longer code generated). switch( numElements & 3 ) { case 3: InlineTransformation( pElements[ i+2 ] ); case 2: InlineTransformation( pElements[ i+1 ] ); case 1: InlineTransformation( pElements[ i+0 ] ); case 0: break; }
  • 33. Loop unrolled… now what?  If you unrolled your loop 4 times, you might be able to use SIMD  Use AltiVec intrinsics – align your data to 16 bytes  128 bit registers - operate on 4 32-bit values in parallel  Most SIMD instructions have 1 cycle throughput  Consider using SOA data instead of AOS AOS: Arrays of interleaved posX, posY, posZ structures SOA: A structure of arrays for each field dimensioned for all elements