A fury unlike any other…
It’s finally time to talk specifics here people – can the new AMD Fury X rival the performance of the GeForce GTX 980 Ti?
Officially unveiled by AMD during E3 last week, we are finally ready to show you our review of the brand new Radeon R9 Fury X graphics card. Very few times has a product launch meant more to a company, and to its industry, than the Fury X does this summer. AMD has been lagging behind in the highest-tiers of the graphics card market for a full generation. They were depending on the 2-year-old Hawaii GPU to hold its own against a continuous barrage of products from NVIDIA. The R9 290X, despite using more power, was able to keep up through the GTX 700-series days, but the release of NVIDIA's Maxwell architecture forced AMD to move the R9 200-series parts into the sub-$350 field. This is well below the selling prices of NVIDIA's top cards.
The AMD Fury X hopes to change that with a price tag of $650 and a host of new features and performance capabilities. It aims to once again put AMD's Radeon line in the same discussion with enthusiasts as the GeForce series.
The Fury X is built on the new AMD Fiji GPU, an evolutionary part based on AMD's GCN (Graphics Core Next) architecture. This design adds a lot of compute horsepower (4,096 stream processors) and it also is the first consumer product to integrate HBM (High Bandwidth Memory) support with a 4096-bit memory bus!
Of course the question is: what does this mean for you, the gamer? Is it time to start making a place in your PC for the Fury X? Let's find out.
Recapping the Fiji GPU and High Bandwidth Memory
Because of AMD's trickled-out offense with the release of the Fury X, we already know much about the HBM design and the Fiji GPU. HBM is a fundamental shift in how memory is produced and utilized by a GPU. From our original editorial on HBM:
The first step in understanding HBM is to understand why it’s needed in the first place. Current GPUs, including the AMD Radeon R9 290X and the NVIDIA GeForce GTX 980, utilize a memory technology known as GDDR5. This architecture has scaled well over the past several GPU generations but we are starting to enter the world of diminishing returns. Balancing memory performance and power consumption is always a tough battle; just ask ARM about it. On the desktop component side we have much larger power envelopes to work inside but the power curve that GDDR5 is on will soon hit a wall, if you plot it far enough into the future. The result will be either drastically higher power consuming graphics cards or stalling performance improvements of the graphics market – something we have not really seen in its history.
Historically, when technology comes to an inflection point like this, we have seen the integration of technologies on the same piece of silicon. In 1989 we saw Intel move cache and floating point units onto the processor die, in 2003 AMD was the first to merge the north bridge and memory controller into a design, then graphics, the south bridge even voltage regulation – they all followed suit.
The answer for HBM is an interposer. The interposer is a piece of silicon that both the memory and processor reside on, allowing the DRAM to be in very close proximity to the GPU/CPU/APU without being on the same physical die. This close proximity allows for several very important characteristics that give HBM the advantages it has over GDDR5. First, this proximity allows for extremely wide communication bus widths. Rather than 32-bits per DRAM we are looking at 1024-bits for a stacked array of DRAM (more on that in minute). Being closer to the GPU also means the clocks that regulate data transfer between the memory and processor can be simplified, and slower, to save power and complication of design. As a result, the proximity of the memory means that the overall memory design and architecture can improve performance per watt to an impressive degree.
So now that we know what an interposer is and how it allows the HBM solution to exist today, what does the high bandwidth memory itself bring to the table? HBM is DRAM-based but was built with low power consumption and ultra wide bus widths in mind. The idea was to target a “wide and slow” architecture, one that scales up with high amounts of bandwidth and where latency wasn’t as big of a concern. (Interestingly, latency was improved in the design without intent.) The DRAM chips are stacked vertically, four high, with a logic die at the base. The DRAM die and logic die are connected to each other with through silicon vias, small holes drilled in the silicon that permit die to die communication at incredible speeds. Allyn taught us all about TSVs back in September of 2014 after a talk at IDF and if you are curious in how this magic happens, that story is worth reading.
The first iteration of HBM on the flagship AMD Radeon GPU will include four stacks of HBM, a total of 4GB of GPU memory. That should give us in the area of 500 GB/s of total bandwidth for the new AMD Fiji GPU; compare that to the R9 290X today at 320 GB/s and you’ll see a raw increase of around ~56%. Memory power efficiency improves at an even great rate: AMD claims that HBM will result in more than 35 GB/s of bandwidth per watt of power consumption by the memory system while GDDR5 only gets over 10 GB/s.
AMD has sold me on HBM for high end GPUs, I think that comes across in this story. I am excited to see what AMD has built around it and how this improves their competitive stance with NVIDIA. Don’t expect to see dramatic decreases in total power consumption with Fiji simply due to the move away from GDDR5, though every bit helps when you are trying to offer improved graphics performance per watt. How a 4GB limit to the memory system of a flagship card in 2015-2016 will pan out is still a question to be answered but the additional bandwidth it provides offers never before seen flexibility to the GPU and software developers.
And from Josh's recent Fiji GPU architectural overview:
AMD leveraged HBM to feed their latest monster GPU, but there is much more to it than memory bandwidth and more stream units.
HBM does require a new memory controller as compared to what was utilized with GDDR5. There are 8 new memory controllers on Fiji that interface directly with the HBM modules. These are supposedly more simple than what we have seen with GDDR5 due to not having to work at high frequencies. There is also the logic chips at the base of the stacked modules and the less exotic interface needed to address those units as again compared to GDDR5. The changes have resulted in higher bandwidth, lower latency, and lower power consumption as compared to previous units. It also likely means a smaller amount of die space needed for these units.
Fiji also improves upon what we first saw in Tonga. It can do as many theoretical primitives per clock (4) as Tonga, but AMD has improved the geometry engine so that the end result will be faster than what we have seen previously. It will have a per clock advantage over Tonga, but we have yet to see how much. It shares the 8 wide ACE (Asynchronous Compute Engine) that is very important in DX12 applications which can leverage them. The ACE units can dispatch a large amount of instructions that can be of multiple types and further leverage the parallelization of a GPU in that software environment.
The chips features 4 shader engines each with its own geometry processor (each processor improved from Tonga). Each shader engine features 16 compute units. Each CU again holds 4 x 16 vector units plus a single scalar unit. AMD categorizes this as a 4096 stream unit processor. The chip has the xDMA engine for bridgeless CrossFire, the TrueAudio engine for DSP accelerated 3D audio, and the latest VCE and UVD accelerators for video. Currently the video decode engine supports up to H.265, but does not handle VP9… yet.
In terms of stream units it is around 1.5X that of Hawaii. The expectation off the bat would be that the Fiji GPU will consume 1.5X the power of Hawaii. This, happily for consumers, is not the case. Tonga improved on power efficiency to a small degree with the GCN architecture, but it did not come close to matching what NVIDIA did with their Maxwell architecture. With Fiji it seems like AMD is very close to approaching Maxwell.
Fiji includes improved clock gating capabilities as compared to Tonga. This allows areas not in use to go to a near zero energy state. AMD also did some cross-pollination from their APU group with power flow. Voltage adaptive operations only apply the necessary voltage that is needed to complete the work for a specific unit. My guess is that there are hundreds, if not thousands, of individual sensors throughout the die that provide data to a central controller that handles voltage operations across the chip. It also figures out workloads so that it doesn’t overvolt a particular unit more than it needs to to complete the work.
The chip can dispatch 64 pixels per clock. This gets important for resolutions of 4K because those pixels need to be painted somehow. The chip includes 2 MB of L2 cache, which is double of the previous Hawaii. This goes back to the memory subsystem and 4 GB of memory. A larger L2 cache is extremely important for consistently accessed data for the compute units. It also helps tremendously in GPGPU applications.
Fiji is certainly an iteration of the previous GCN architecture. It does not add a tremendous amount of features to the line, but what it does add is quite important. HBM is the big story as well as the increased power efficiency of the chip. Combined this allows a nearly 600 sq mm chip with 4GB of HBM memory to exist at a 275 watt TDP that exceeds that of the NVIDIA Titan X by around 25 watts.
Now that you are educated on the primary changes brought forth by the Fiji architecture itself, let's look at the Fury X implementation.
AMD Radeon R9 Fury X Specifications
AMD has already announced that the flagship Radeon R9 Fury X is going to have some siblings in the not-too-distant future. That includes the R9 Fury (non-X) that partners will sell with air cooling as well as a dual-GPU variant that will surely be called the AMD Fury X2. But for today, the Fury X stands alone and has a very specific target market.
|R9 Fury X||GTX 980 Ti||TITAN X||GTX 980||TITAN Black||R9 290X|
|Rated Clock||1050 MHz||1000 MHz||1000 MHz||1126 MHz||889 MHz||1000 MHz|
|Memory Clock||500 MHz||7000 MHz||7000 MHz||7000 MHz||7000 MHz||5000 MHz|
|Memory Interface||4096-bit (HBM)||384-bit||384-bit||256-bit||384-bit||512-bit|
|Memory Bandwidth||512 GB/s||336 GB/s||336 GB/s||224 GB/s||336 GB/s||320 GB/s|
|TDP||275 watts||250 watts||250 watts||165 watts||250 watts||290 watts|
|Peak Compute||8.60 TFLOPS||5.63 TFLOPS||6.14 TFLOPS||4.61 TFLOPS||5.1 TFLOPS||5.63 TFLOPS|
The most impressive specification that comes our way is the stream processor count, sitting at 4,096 for the Fury X, an increase of 45% when compared to the Hawaii GPU used in the R9 290X. Clock speeds didn't decrease either to get to this implementation which means that gaming performance has the chance to be substantially improved with Fiji. Peak compute capability jumps from 5.63 TFLOPS to an amazing 8.6 TFLOPS with Fiji, easily outpacing even the NVIDIA GeForce GTX Titan X rated at 6.14 TFLOPS.
Texture units also increased by the same 45% amount but there is a question on the ROP count. With only 64 render back ends present on Fiji, the same amount as the Hawaii XT GPU used on the R9 290X, the GPUs capability for final blending might be in question. It's possible that AMD feels that the ROP performance of Hawaii was overkill for the pixel processing capability it provided and thus thought the proper balance was found in preserving the 64 ROPs count on Fiji. I think we'll find some answers in our benchmarking and testing going forward.
With 4GB on board, a limitation of the current generation of HBM, the AMD Fury X stands against the GTX 980 Ti with 6GB and the Titan X with 12GB. Heck, even the new Radeon R9 390X and 390 ship with 8GB of memory. That presents another potential problem for AMD's Fiji GPU: will the memory bandwidth and driver improvements made be enough to counter the smaller frame buffer size of Fury X compared to its competitors? AMD is well aware of this but believes that a combination of the faster memory interface and "tuning every game" to ensure that the 4GB memory limit will prevent the bottleneck. AMD noted that the GPU driver is what is responsible for memory allocation and technologies like memory compression and caching can drastically impact memory footprints.
While I agree that the HBM implementation should help things, I don't think it's automatic; GDDR5 and HBM don't differ by that much in net bandwidth or latency. And while tuning for each game will definitely be important, that puts a lot of pressure on AMD's driver and developer relations teams to get things right on day one of every game's release.
At 512 GB/s, the AMD Fury X exceeds the available bandwidth of the GTX 980 Ti by 52%, even with a rated memory clock speed of just 500 MHz. That added memory performance should allow AMD to be more flexibile with memory allocation, but drivers will definitely have to be Fiji-aware to change how it brings in data to the system.
Fury X's TDP of just 275 watts, 15 watts lower than the Radeon R9 290X, says a lot for the improvement in efficiency that Fiji offers over Hawaii. However, the GTX 980 Ti still runs at a lower 250 watts; I'll be curious to see how this is reflected in our power testing later.
Just as we have seen with NVIDIA's Maxwell design, the 28nm process is being stretched to its limits with Fiji. A chip with 8.9 billion transistors is no small feat, running past the GM200 by nearly a billion (and even that was astonishing when it launched).