The Polaris Architecture – 4th Generation GCN
Polaris marks 4th generation of AMD’s GCN architecture, an evolution of a design put in place with the HD 7000 series of graphics card in 2011. This iteration adds some crucial new pieces to the puzzle though it does so without drastic fundamental changes that might have caused problems on a new process node. Additions and changes include improvements in geometry processing, variable resolution rendering, memory controller and compression changes, asynchronous compute modifications and display output support.
From a high level, the Radeon RX 480 looks very similar to previous GCN-based chips. The block diagram above details what we already know: 36 Compute Units, 2304 stream processor, 1 GCP (Graphics Command Processor), 4 ACEs (Asynchronous Compute Engines) and 2 HWS (hardware schedulers). There are four geometry processors, one for each shader engine, 144 texture units and 32 ROPs / raster operators. If you have seen a diagram of the Radeon R9 290X Hawaii GPU, this is similar. The inclusion of a second hardware scheduler should help asynchronous compute capability and is made even more interesting by the fact that it apparently existed in secret on Hawaii as well.
One of the areas that GCN has continually fallen behind NVIDIA’s GPU architectures is geometry processing. AMD is hoping to improve that situation somewhat with Polaris by increasing throughput. The architecture adds primitive discard acceleration that helps to remove triangles that are hidden or are zero pixels in size from the pipe before processing occurs. As you increase multi-sample antialiasing levels this discard acceleration will scale performance improvements up. A new index cache for small, instanced geometry lowers the overhead of data movement on the chip as well, helping to improve total geometry throughput in workloads that use the feature for mass object quantities.
Though the operations per clock of Polaris is identical to that of Hawaii, AMD did put in work to improve the efficiency of the shaders to decrease power consumption. Things like instruction prefetch changes, higher buffer sizes per wave of instructions, a slightly tuned L2 cache and support for native FP16 and Int16 all work in favor of lower power. Though they do not sound significant, these changes should result in a net efficiency improvement for shaders of the RX 480 / Polaris of 15% when compared to the R9 290 / Hawaii.
Shader intrinsic functions are a feature that is common on the console space but is just now being serviced to the PC. The idea is that content developers can insert assembly directly into code without having to run your entire application in assembly. This allows for specific cases where a coder knows the most efficient way to do something to actually implement it without adversely affecting the rest of the application. With shader intrinsic functions you’ll be able to access low level shaders that wouldn’t otherwise be available and loop operations without overhead penalties.
The memory controller has been improved with Polaris in order to support GDDR5 memory up to 8 Gbps resulting in 256 GB/s of bandwidth. I did ask specifically: the Polaris 10 GPU does NOT support GDDR5X. Other changes in the memory system result in higher effective and relative memory throughput including updated DCC (delta color compression) and a doubling of the size of the L2 cache when compared to the R9 290.
As you should expect, AMD continues to push its technological advantage in asynchronous compute and was open to discussing the differences between preemption, concurrent and prioritized compute models. While NVIDIA’s Pascal definitely improved on the company’s implementation of asynchronous capability, AMD still has the advantage in some key areas, including the addition of the Quick Response Queue that allows dynamically shifting priority to be assigned to in-flight workloads, adjusting compute performance as the application demands. This is an area that can be utilized by late warp functions in Oculus VR.
Part of what makes this granularity possible is the inclusion of dedicated hardware schedulers in the silicon that offload scheduling and are able to administer real time prioritized queues. With Polaris AMD now has enabled two HWS units. According to AMD, there are two currently in place on Fiji-based products but were only used internally to validate in testing they could be used in tandem. The result is a dual HWS implementation in Polaris.