Asynchronous Compute: Let the debate continue
With the GTX 900-series of graphics cards and the Maxwell architecture, NVIDIA received a lot of flak for not having full and proper support for asynchronous shaders, and idea that is implemented in DirectX 12 to enable better use of GPU processing power and to improve performance and efficiency. Even though there are only a handful of games out that will utilize asynchronous shading capability, including Ashes of the Singularity and the new Hitman, NVIDIA cards were under constant fire.
Asynchronous compute and workloads are useful with a GPU is doing multiple tasks at the same time, in addition to pixel rendering. These workloads could including physics processing, audio processing, post-processing of already rendered frames and more specific jobs like late time warp for VR experience enhancements.
Pascal improves the story dramatically for NVIDIA, though there will still be debate as to how its integration to support asynchronous compute compares to AMD’s GCN designs. NVIDIA sees asynchronous computing as creating two distinct scenarios: overlapping workloads and time critical workloads.
Overlapping workloads are used when a GPU does not fill its processing capability with a single workload alone, leaving gaps or bubbles in the compute pipeline that degrade efficiency and slow down the combined performance of the system. This could be PhysX processing for GeForce GPUs or it might be a post-processing step that a game engine uses to filter the image as a final step. In Maxwell, this load balancing had to work with a fixed partitioning model. Essentially, the software had to say upfront how much time of the GPU it wanted divided between the workloads in contention. If the balance of the workloads stays in balance, this can be an efficient model, but any shift in the workloads would mean either unwanted idle time or jobs not completing in the desired time frame. Pascal addresses this by enabling dynamic load balancing that monitors the GPU for when work being added, allowing the secondary workload to take the bubbles in the system to be used for compute.
Time critical workloads create a different problem – they need prioritization and need to be inserted ASAP. An example of this is the late time warp used by the Oculus Rift to morph the image at the last possible instant with the most recent motion input data. With Maxwell, there was no way to have granular preemption the system had to set a fixed time to ask for the asynchronous time warp (ATW) to start, meaning that the system would often leave GPU compute performance on the table, under-utilizing the hardware.
Pascal is the first GPU architecture to implement a pixel level preemption capability for graphics. The graphics units will keep track of their intermediate progress on the current rendering workload so that they can stop, save their state and move off the hardware to allow for the preempted workload to be addressed quickly. NVIDIA tells us the entire process of context switching can occur in less than 100 microseconds after the last pixel shading work is finished. Similarly for compute tasks, Pascal integrates thread level preemption. If you happen to be running CUDA code, Pascal can support preemption down the instruction level!
The combination of dynamic scheduling and pixel/thread level preemption in hardware improve NVIDIA’s performance on asynchronous compute and workloads pretty dramatically. In the asynchronous time warp example above, Pascal will be able to give more time to the GPU for rendering tasks than Maxwell, waiting until the last moment to request the time warp via preemption. This capability is already built into and support by Oculus.
Asynchronous compute changes are a bit part of NVIDIA’s GTX 1080 performance improvement claims, pointed squarely at VR. Compared to the GTX 980, based on NVIDIA’s claims, the GTX 1080 has 1.7x improved performance for standard gaming but is 2.7x faster in VR, in large part do those changes listed above.
Does this mean that NVIDIA is on-par or ahead of AMD in terms of asynchronous compute? It’s hard to say, as the implementations are very different between the two architectures. AMD GCN still has the Asynchronous Compute Engines that uses asynchronous shaders, allowing multiple kernels to execute on the GPU concurrently without preemption. AMD also recently introduced Quick Response Queues in the second generation GCN products that allow developers to specify a higher priority async shaders that are time sensitive, like the Rift ATW.
At the end of the day it comes down to the resulting performance of each product. We are working on a couple of interesting ways to test asynchronous compute capability of GPUs directly to see how things stack up from a scientific viewpoint, but when the rubber hits the road, which GPU gets you the highest frame rate and lowest latency? That we can test today.