148. /* Further Reading */ - Steven Tovey & Stephen McAuley, “Parallelized Light Pre-Pass Rendering with the Cell Broadband Engine”, GPU Pro - Stephen McAuley & Steven Tovey, “A Bizarre Way to do Real-Time Lighting”, Develop in Liverpool 2009
Damage data... Position offset, Normal offset, scratch and dent levels.
Explain why 16KB chunks, MFC max per transfer.
Don’t want lumpiness if parallel read/write.
Don’t want lumpiness if parallel read/write.
Rim lighting here.
Tyres use low-power specular.
Brake lights.
Used on alloys for low-power specular.
Used for the scratch lighting, again for low-power specular.
We need to look at the pipeline of the graphics card to work out how we can move more of our GPU work onto the SPUs. Two main areas we can insert data – either through vertices at the top, or textures at the fragment stage. Sadly, we can’t hook into the rasteriser, which would be ace.
Of course, these look-up textures end up being screen-space look-up textures, which means some sort of deferred rendering…
I have a problem with forward rendering. I think most people traditionally design their engine this way, especially on 360 and PC. But all the work is done in the fragment shader, so when you port to the PS3 with a slower fragment shader unit, your whole game runs slower. Although you can use EDGE to speed up your vertex processing and your post processing, they both only step around the core of the issue that you’re fragment shader bound and there’s no easy way of solving it.
We found a light pre-pass renderer suited our goals pretty well. It’s a halfway house between traditional and deferred rendering.
We render a rear-view mirror, cube map reflections for the cars and planar reflections for the road and water in addition to the pre-pass and main views. Multi-threaded rendering helps a lot!
Deferring by a frame isn’t ideal. Either you just use the previous frame’s lighting buffer for the next frame, with obvious artefacts (especially if you’re doing a racing game like us), or you have to add a frame of latency.I don’t think adding frames of latency is ideal, especially for cross-platform games. If you add a frame of latency on the PS3, are you going to do the same on the 360? If you’re not, then game play could be different between both platforms.I’m not saying this is something I’d never do, I think in lots of circumstances you’ll have to. But avoid it where you can, and this is one instance.
If we wanted to take this further for future projects, we could add shadow maps in at the start of our pipeline, then do an exponential blur on the SPUs whilst we’re rendering the pre-pass geometry…
This is real multi-threaded graphics processing, with multiple processors doing different jobs at the same time. Therefore, architect your engine accordingly!Having small graphics jobs allows you to spread the workload. Obviously, not everything can be done like this. Some things will most likely have to be deferred a frame, adding a frame of latency, such as post-processing or MLAA. But there’s lots of tasks, smaller tasks, that don’t have to be, from SSAO to blurring exponential shadow maps. You have to find things to parallelise with!Think about the data again! Rendering has lots of stages, each with its own inputs and outputs. What could sync with what?
We combine the normals and depth into one 32-bit buffer. This is an optimisation as it halves the inputs into the SPU program, but also allows us to keep the depth buffer in local memory which is good for performance.
The first step, but the biggest stumbling block!
No blocking! Our jobs are optionally dependent on a label.
To be accurate, we have a jump-to-self per SPU.
When we load in a tile, we quickly iterate over every pixel and calculate minimum and maximum depth.No need to use a stencil buffer to cull out the sky as depth min and max will do it for us. (Remember, we don’t have the stencil buffer as we’re not using the depth buffer!)This technique is really useful for a variety of things, including depth of field (check out Matt Swoboda’s optimisation in PhyreEngine).
This is actually the easiest bit. Just write the lighting equations in intrinsics! However, they really have to be fast otherwise performance just won’t be good enough. Next is some helpful tips for optimisation.
So we triple buffer. It ends up that we have plenty of local store left as it’s simple job and our job size was relatively small. Another reason to write in siintrinsics though as it keeps the code size down!
Just like Ste said earlier, this is a big win. Probably a good rule of thumb for most SPU jobs!
Just like Ste said earlier, this is a big win. Probably a good rule of thumb for most SPU jobs!
When kicking SPU jobs off on the RSX, you have to be careful as you can interfere with jobs the PPU is running. This is where sync-free systems are a win! We’re lucky as we just avoided the physics, but also, running only on 3 SPUs was a good idea so we had 3 free for other tasks. See how quick the rendering is even though we’re rendering so many views!