Tessellation, Texture Filtering, Compute Architecture and Media Processing
Along with these compute and architecture changes come quite a few additional performance improvements starting with tessellation. When NVIDIA introduced Fermi to the world it far exceeded the tessellation performance that AMD offered in their Radeon 5000 series of cards and while both Barts (6800) and Cayman (6900) architectures improved it, NVIDIA still holds a pretty dramatic lead.
Southern Islands looks to improve on tessellation performance by a factor of 4x while not "wasting die space" as they claim NVIDIA has done.
The 9th generation tessellation engine from AMD, built inside the dual geometry engines, improves performance with off-chip buffers and slightly larger caches. The changes AMD has made will improve tessellation at each factor (used to determine tessellation levels) and in many cases, the performance difference is notable.
AMD has provided a graph that demonstrates the scaling ability of the new tessellation engine on Southern Islands when compared to the Radeon HD 6970. Raw tessellation rates have increased from 1.6x to 4x with the majority of the improvement found after a tessellation factor of 9 or 10. Games and applications that utilize tessellation heavily, like Unigine Heaven, Lost Planet 2 and Crysis 2, see jumps from 55% to 139% in relative performance.
Whether or not this is enough to catch the current generation of GTX 580 designs has yet to be seen though with an obvious improvement in pixel shading power as well (with 2048 SPs), the Radeon HD 7970 will obviously become a performance leader.
Being as upfront and honest as they tend to be, AMD discussed their texture filtering quality quite a bit at the technology day in Austin last month. Improvements in the algorithm have greatly reduced the shimmering artifacts seen on certain textures without having to blur it. AMD claims that this is all done without any performance penalty and is a result of simple adjustments on their end – no major hardware changes were needed.
The Southern Islands as a Compute Architecture
Even though AMD was focused nearly completely on the SI architecture as a gaming platform, there is no denying that many of the architectural changes made were the first step in progressing AMD towards a more heterogeneous computing environment. That is after all, the primary goal of the Fusion System Architecture (FSA) detailed by Demers at AFDS this June.
SI has a pair of Asynchronous Compute Engines (ACE) that allow for independent scheduling and work dispatch which will improve multi-tasking efficiency and context switching as well. These engines can operate in parallel with the graphics command processor essentially allowing SI-based parts to work on two varying workloads at the same time. The matching pair of DMA engines can actually completely saturate a PCI Express 3.0 x16 connection with 16 GB/s of bidirectional bandwidth.
As Josh discussed in his original analysis, AMD has the ability to adjust the ratio of single precision to double precision performance on Southern Islands and for their consumer level graphics cards they have chosen a 4:1 mix. While the 7970 is capable of 3.79 TFLOPS of compute, it can handle only 947 GFLOPS of double precision. Because of the move to a vector+scalar architecture though reaching higher utilization of that processing power should be easier for software developers.
The GPU has full ECC protection for DRAM and SRAM and is the first to offer support for OpenCL 1.2, DirectCompute 11.1 and C++ AMP .
Finally, rounding out the new architecture changes is an improved media processing feature that improves performance on SADs – sum of absolute differences. This operation is important for video and image processing algorithms like the one used in AMD’s own SteadyVideo technology. With the Radeon HD 7970 this new QSAD option allows the GPU to handle more than 513 billion calculations per second to keep up with real time playback and adjustment of 1080p 60 FPS content. AMD claims this performance allows the HD 7970 to operate at a 10x performance improvement compared to the company’s own Phenom II X4 980 processor.