AMD Radeon HD 6970
Ryan covered the initial release of the HD 6970, but I thought I would quickly go over the basics. AMD and NVIDIA had some of their long term plans dashed by some significant changes in the foundry business. TSMC and GLOBALFOUNDRIES expected to release their next generation process nodes far sooner than they were able to deliver. TSMC originally was going to do a 32 nm bulk process that was going to be utilized by the graphics folks. They then changed plans and decided to do another half node shrink from that and offer a 28 nm HKMG/bulk solution. Essentially this forced both AMD and NVIDIA to drastically change their roadmaps to accommodate what is essentially a 2 year delay in a next generation process node from the original 40 nm product.
The bundle is nothing special, it doesn’t even have a monitor cleaner or a koosh-ball!
To make up for this gap, both companies did a refresh of their products on that same 40 nm node that the previous HD 5000 and GTX 400 series were based on. In NVIDIA’s case they were able to optimize their designs for TSMC’s 40 nm process much more effectively, thereby allowing them greater yields and more favorable power and heat characteristics. These chips make up the GTX 5×0 series of parts. AMD took it a step further and produced a very radical redesign at their top end with the Cayman chip.
The original 5000 series of products were based on the VLIW (Very Long Instruction Word) 5 design that had been introduced with the HD 2800XT. Essentially there are four ALUs and one ALU/SPU that comprise a stream processor for AMD. This concept was applied successfully to the 3000, 4000, and finally 5000 series. It also made an appearance with the “Barts” chip that powers the HD 6800 boards, as well as the lower end HD 6000 parts that were released a few months back. Cayman is unique in that it sports a new architecture based on a VLIW 4 design which sacrifices the overall number of stream units to maximize SIMD units. The HD 5870 has 20 SIMD units, while the HD 6970 features 24 of them. This design change was implemented to essentially put the rubber more effectively to the road, so to speak. In designs like the HD 5870, it was found that a lot of the ALUs were idle even under heavy load in DirectX 10 and 11 applications. By increasing the number of SIMDs while keeping overall ALU counts in the same region, AMD was able to get better overall thread throughput, and make the data flow much more efficient. A larger portion of the ALUs were being utilized at a much higher rate, so overall efficiency and performance were increased.
Due to the 24 SIMD units, the texture unit count has increased as well. The HD 5870 with the 20 SIMDs had a total of 80 texture sampling units. The HD 6970 has the 24 SIMD units, and therefore features a grand total of 96 texture sampling units. This is a nice little texturing boost for the architecture combined with the faster core clock speed. ROPS stay at 32, but these units are upgraded from the previous generation. These are now two to four times faster than the previous parts, depending on the operations. 16 bit integer ops are 2x faster, 32 bit floating point ops are 2 to 4 times faster, and the ROPS feature coalescing of write operations. We finally get to see some new AA technology with the 6900, and I could not be happier. Some years back NVIDIA introduced their Coverage Sample AA method, which improved overall quality without severely impacting performance. AMD has their own version of this now with their EQ AA settings. They essentially do the same thing by increasing the coverage samples while allowing the chip to keep the z/color/stencil samples at a lower level. While not as good as increasing the z/color/stencil samples, it is an inexpensive way to improve AA quality without degrading performance to a much greater degree.
The basic specifications of the R6970 from MSI.
AMD also increased the primitive setup and geometry throughput. While AMD doubled it over the HD 5000 series, it is still ½ that of NVIDIA’s Fermi architecture. So AMD has increased from the seemingly timeless 1 triangle or primitive per clock to 2 per clock, NVIDIA is still sitting at 4 per clock. Tessellation for AMD has also been doubled from the previous generation of products, but again falls well short of the tessellation power that NVIDIA has with their latest generation of products. This has so far not been a big issue for AMD, except in some synthetic benchmarks and some handpicked TWIMTBP titles (such as H.A.W.X. 2). In most other applications where tessellation is supported, AMD competes very well.
There was a lot of tweaking throughout the GPU to improve its per clock performance over the older HD 5870, even though it has slightly fewer overall ALUs (1536 vs. 1600) it has four more SIMD units and a much improved front end as well as the improved ROP units. Throw in a slightly faster clockspeed and faster memory, and the HD 6970 is around 20% to 30% faster overall than the older HD 5870. Not a bad increase considering AMD was forced to utilize the same process node as the older part. The downside is that it is still 500 million transistors bigger than the older Cypress chip and the largest GPU in terms of die size since the original R600 which powered the HD 2900 XT.