The Fiji GPU

AMD leveraged HBM to feed their latest monster GPU, but there is much more to it than memory bandwidth and more stream units.

HBM does require a new memory controller as compared to what was utilized with GDDR5.  There are 8 new memory controllers on Fiji that interface directly with the HBM modules.  These are supposedly more simple than what we have seen with GDDR5 due to not having to work at high frequencies.  There is also the logic chips at the base of the stacked modules and the less exotic interface needed to address those units as again compared to GDDR5.  The changes have resulted in higher bandwidth, lower latency, and lower power consumption as compared to previous units.  It also likely means a smaller amount of die space needed for these units.

The Fiji GPU features an astounding 4096 stream units.  The HBM memory controller apparently takes up less space than a traditional 320 bit GDDR5 unit, but it is still an impressive array of computational units.

Fiji also improves upon what we first saw in Tonga.  It can do as many theoretical primitives per clock (4) as Tonga, but AMD has improved the geometry engine so that the end result will be faster than what we have seen previously.  It will have a per clock advantage over Tonga, but we have yet to see how much.  It shares the 8 wide ACE (Asynchronous Compute Engine) that is very important in DX12 applications which can leverage them.  The ACE units can dispatch a large amount of instructions that can be of multiple types and further leverage the parallelization of a GPU in that software environment.

The chips features 4 shader engines each with its own geometry processor (each processor improved from Tonga).  Each shader engine features 16 compute units.  Each CU again holds 4 x 16 vector units plus a single scalar unit.  AMD categorizes this as a 4096 stream unit processor.  The chip has the xDMA engine for bridgeless CrossFire, the TrueAudio engine for DSP accelerated 3D audio, and the latest VCE and UVD accelerators for video.  Currently the video decode engine supports up to H.265, but does not handle VP9… yet.

In terms of stream units it is around 1.5X that of Hawaii.  The expectation off the bat would be that the Fiji GPU will consume 1.5X the power of Hawaii.  This, happily for consumers, is not the case.  Tonga improved on power efficiency to a small degree with the GCN architecture, but it did not come close to matching what NVIDIA did with their Maxwell architecture.  With Fiji it seems like AMD is very close to approaching Maxwell.

The overall substrate is very large from what we have come to expect, but it is far, far smaller than the usual PCB area that is taken up by the chip, substrate, and memory chips.

Fiji includes improved clock gating capabilities as compared to Tonga.  This allows areas not in use to go to a near zero energy state.  AMD also did some cross-pollination from their APU group with power flow.  Voltage adaptive operations only apply the necessary voltage that is needed to complete the work for a specific unit.  My guess is that there are hundreds, if not thousands, of individual sensors throughout the die that provide data to a central controller that handles voltage operations across the chip.  It also figures out workloads so that it doesn’t overvolt a particular unit more than it needs to to complete the work.

The chip can dispatch 64 pixels per clock.  This gets important for resolutions of 4K because those pixels need to be painted somehow.  The chip includes 2 MB of L2 cache, which is double of the previous Hawaii.  This goes back to the memory subsystem and 4 GB of memory.  A larger L2 cache is extremely important for consistently accessed data for the compute units.  It also helps tremendously in GPGPU applications.

Fiji is certainly an iteration of the previous GCN architecture.  It does not add a tremendous amount of features to the line, but what it does add is quite important.  HBM is the big story as well as the increased power efficiency of the chip.  Combined this allows a nearly 600 sq mm chip with 4GB of HBM memory to exist at a 275 watt TDP that exceeds that of the NVIDIA Titan X by around 25 watts.

The AMD Fury product might be a fine example of the Gestalt Theory.  The individual components might not be as impressive to the eye, but when put together they make something greater.  There is a lot of unique and first run technology going into Fiji and the Fury X product, but it seems that it is at the very least competitive with the latest from NVIDIA.  AMD took some serious risks with implementing HBM memory into a consumer grade product, but the technology seems mature enough to see it working at the high end of the price spectrum for graphics.  We do not have any word about the yields of the interposer + memory + GPU integration, but obviously it is good enough to introduce a series of cards.

Even with a high end power delivery system, HBM allows manufacturers to condense the entire board into a 7.5" PCB.

For this year AMD and NVIDIA are stuck at 28 nm process technology for their GPUs, but this will change in 2016.  Right now AMD has a distinct advantage of not only developing the HBM technology, but also implementing it and working out the bugs first.  Their next generation parts in 2016 will utilize HBM as well as 14/16 nm process technology.  It could very well be that AMD will have a leg up on this technology before NVIDIA releases Pascal which will utilize HBM 2.0.  Fiji is a very interesting part that will be integrated into four distinct products across the price spectrum.  The dual Fiji implementation will likely be the fastest single video card around for some time once it is introduced in the Fall.  This may not be the homerun that AMD was hoping for, but it certainly is a compelling product and a solid foundation for cards in the future.

« PreviousNext »