2. WHAT’S SO SPECIAL ABOUT CONSOLE GAME DEV?
NOW THAT CONSOLES MOSTLY RUN PC HARDWARE
Extreme performance optimizations
‒ Until gamers opt for shorter upgrade cycles (phones/tablets business model) ?
‒ Can’t run sub-optimal audio code when competing for cycles on crowded compute queues
Custom hardware, OS, drivers and compilers
‒ To extract max perf from fixed hardware
‒ Helps lengthening platform life time
‒ “But but… where’s my OpenCL runtime?”
Low latency
‒ Music games on consoles need it as much as professional music prod software on desktop
‒ But is much harder to achieve reliably when a system is constantly overloaded
2 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
3. GAME AUDIO DSP ON THE ACP
WHY?
Heavy specialized DSP workloads
‒ Stuff games need badly but don’t really want to deal with
‒ Best fit for dedicated and/or fixed function hardware
‒ Codecs
‒
‒
‒
‒
CELP codecs -> party chat
100s of MP3/AT9/AAC decode instances
Huge impact on game assets footprint, down/load times
Optional output bitstream encoding (AC3/DTS)
‒ Voice recognition
‒ Echo cancelation
Platform wide IP licensing levels the playing field
‒ Good for indy developers
‒ And good for the platform!
Available via asynchronous secure system APIs
3 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
4. GAME AUDIO DSP ON THE ACP
WHY NOT?
Exotic hardware and dev environment
‒ Closed to games
‒ Closed to middleware
‒ Platform specific
Asynchronous interface
‒ Can’t have sequential interleaving of DSP back and forth between CPU and ACP w/o latency buildup
‒ But ultimately, we want the DSP pipeline to be data driven (by artists who know nothing about this)
‒ Modularity
Slow clock rate @ 800MHz, very limited SIMD and no FP support
‒ Tough sell against Jaguar for many DSP algorithms
‒ Very tight local memory shared by multiple DSP cores
Already pretty busy with codec loads and system tasks
4 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
5. GAME AUDIO DSP ON THE GPU
WHY?
Much more demand for real-time effects today and will keep growing
CPU FLOPS likely to stagnate and could even decline in HSA as CUs takes over SIMD workloads
Flexibility: some games are CPU bound, others are GPU bound…
hUMA is a game changer (removes NUMA’s main bottleneck: GPU write back)
Compute queues with prioritized scheduling and even some form of preemption
Many real-time audio DSP algorithms work well on wide SIMD units
‒ FFT convolution (spectral processing in general)
‒ Mixing, resampling, wave shaping, etc…
Mostly coalesced mem accesses
Low/med bandwidth (< 1GB/s)
5 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
6. GAME AUDIO DSP ON THE GPU
WHY NOT?
Some algorithms do not work (as) well on wide SIMD units
‒ IIR filters, ADPCM decodes, dynamics: data recursion causes thread interdependencies within wavefronts
‒ Typical AAA game runs 1000s of biquads at various stages in the filtergraph
Workloads may require batch voice processing to achieve high CU efficiency
‒ Build 2D grids (channels x samples) or 3D grids (channels x subbands x samples)
‒ Swizzling is key but watch out for runtime cost as SIMD widens (static vs dynamic)
Batch processing goes against free form MaxMSP model artists are pushing for
‒ Unique DSP chain for each sound “just because we can!”
‒ Data driven filtergraph and DSP pipeline
Complex prioritized scheduling & dispatching compute queues
‒ Do not prevent intermittent CU saturation caused by large graphics workloads
‒ Risky for low latency direct path audio DSP
Proprietary hardware, drivers and shader compilers (PSSL)
‒ Audio middleware will need a some incentive to move up there
‒ Most will probably stay on the CPU
6 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
7. GAME AUDIO DSP ON JAGUAR
WHY?
Well known and open x64 dev environment
‒ Middleware friendly
‒ CLANG/LLVM solid & stable
Full FP unit with SSE4 support
Early PA is surprisingly good for compiled intrinsics code
‒ ~10% slower than core i7 @ same clock rate
‒ GDDR5 latency is not an issue
‒ < ~50% of 1 core @ 1.6GHz running the entire KZSF filtergraph
Only reliable solution for ultra low latency
‒ Music and rhythm games
‒ Run 100% on CPU (including decoding)
7 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
8. GAME AUDIO DSP ON JAGUAR
WHY NOT?
“Weak laptop CPU” compared to top of the line on desktop
‒ No FMA4
‒ Slow clock @ 1.6GHz (compared to typical desktop)
256bit AVX mostly useless
Possible bottleneck down the line
8 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
9. GAME ENGINE CODE
THIN COMPUTE
3D audio
‒ Sound emitters (distance, directionality and size modeling)
‒ Sound listeners (mic and ear modeling)
‒ Sound geometry (collision meshes)
‒ Deeper physical modeling of sound propagation
‒ Simple ray casting (occlusion, obstruction, indirect audio)
‒ Advanced ray casting (diffraction, real-time individual early reflection tracking)
Physics
‒ Rigid body dynamics (collisions, friction, destruction)
‒ Fluid dynamics (turbulences)
Animation, special FX
‒ Inline audio sequencing and modulation
‒ Foley, coarse granular synthesis
9 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL
10. CONCLUSIONS
HSA + hUMA is a great combo for high perf game audio!
‒ Maximized perf per W from specialized hardware (CPU + GPU + ACP)
‒ Our challenge is to figure out what to run where and when
ACP is a great fit for codecs and OS services
‒ But not for modular synthesis and highly customized DSP pipelines
GPU is great fit for mid/high latency DSP and high level 3D thin compute
‒ Indirect (reflected) audio
‒ Convolution reverb
‒ 3D ray casting for occlusion/obstruction/diffraction
CPU is still the best fit for everything else:
‒ Open modular synthesis frameworks and middleware
‒ Low latency audio
10 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL