Fiji: A Big and Necessary Jump
AMD has released Fiji to power the Fury series of cards
Fiji has been one of the worst kept secrets in a while. The chip has been talked about, written about, and rumored about seemingly for ages. The chip has promised to take on NVIDIA at the high end by bringing about multiple design decisions that are aimed to give it a tremendous leap in performance and efficiency as compared to previous GCN architectures. NVIDIA released their Maxwell based products last year and added to that this year with the Titan X and the GTX 980 Ti. These are the parts that Fiji is aimed to compete with.
The first product that Fiji will power is the R9 Fury X with integrated water cooling.
AMD has not been standing still, but their R&D budgets have been taking a hit as of late. The workforce has also been pared down to the bare minimum (or so I hope) while still being able to design, market, and sell products to the industry. This has affected their ability to produce as large a quantity of new chips as NVIDIA has in the past year. Cut-backs are likely not the entirety of the story, but they have certainly affected it.
The plan at AMD seems to be to focus on very important products and technologies, and then migrate those technologies to new products and lines when it makes the most sense. Last year we saw the introduction of “Tonga” which was the first major redesign after the release of the GCN 1.1 based Hawaii which powers the R9 290 and R9 390 series. Tonga delivered double the tessellation performance over Hawaii, it improved overall architecture efficiency, and allowed AMD to replace the older Tahiti and Pitcairn chips with an updated unit that featured xDMA and TrueAudio support. Tonga was a necessary building block that allowed AMD to produce a chip like Fiji.
Building a Monster
There are many subtle changes throughout Fiji when compared to older GCN based architectures, but the biggest leap in technology is obviously High Bandwidth Memory. This technology was started by AMD around 7 years ago with initial planning stages. The project started to gain a lot of steam around 5 years ago, and products outside of the GPU realm have actually integrated parts of this technology (namely the interposer) and productized it. Fiji is probably the first extremely large, high performance part that utilizes HBM.
Ryan and myself have gone over HBM pretty thoroughly. The long and short of it is that HBM is a new memory interface that brings memory closer to the chip, widening the bus, and lowering the clock speeds while increasing overall memory bandwidth and improving latency. Sounds like the next best thing to integrating memory onto the die of a GPU?
We are waiting for a deep dive on the interposer technology, but for now enjoy this purloined hot of the microbumps that AMD is implementing on Fiji.
This magic happens because of a silicon interposer and microbumps that allow thousands of data, power, and ground lines to be routed effectively. Stacked memory chips on top of logic chips, all interconnected by through-silicon-vias allow extremely wide, efficient, and fast communication with the primary GPU/controller. Fiji features a 4096 bit memory bus to 4 stacked memory modules that run at a conservative 500 MHz. The ultra-wide interface plus lower clock speed makes for an extremely power efficient setup as compared to a traditional GDDR5 interface. At 500 MHz the memory chips do not produce a lot of heat at modern process nodes. It also does not require much power again as compared to GDDR5 interfaces. AMD claims that it sees about 4X the bandwidth performance per watt over GDDR5. Seeing their implementation, I am not arguing that number.
The silicon interposer is fabricated much like modern ASICs, except the features do not need to be anywhere near as dense and fine. I believe that the current interposers used by AMD on Fiji are fabricated on an older 65 nm process, but the minimum feature size is around 100 micrometers. These are still very fine lines to use for interconnects and pathways, but it is nowhere near as complex as a modern CPU or GPU. Small defects on such a process and implementation will be very unlikely to cause an interposer to be defective, so yields in theory for the interposer should approach 100%.
The stacked memory and interposer also have a very positive effect on overall board area. Since they are all so closely linked together, and the memory chips are stacked, we have a relatively small footprint for the GPU/memory group. There is significant area savings from HBM and that results in lower PCB complexity. The entire substrate still needs to be powered with plenty of connections to the PCI-E bus, but the overall PCB design will be smaller and less complex than a GDDR5 implementation.
HBM is a fantastic technology to tackle bandwidth, power, and latency in a modern ASIC.
If there is one potential downside to this first generation of HBM is that it is limited to 4GB in this particular implementation. Some may be disappointed in this as compared to the latest R9 390 series which features 8 GB, or the GTX 980 Ti with 6 GB, or the monstrous Titan X with 12 GB of memory. There may not be need for panic here. Joe Macri talked quickly about what AMD is doing to address this particular issue. Joe mentioned that in previous generations memory space was not a big deal, as when they hit some limits they simply doubled up the memory. With GDDR5 this was fairly easy to implement going from 1GB to 2 to 3 to 4 and above. HBM 1.0 does not have this flexibility, so AMD had to do some engineering to get around this issue. Joe did not go into details about what they did, but I can take a few guesses as to how they addressed this. I think the key is most likely a combination of leveraging the larger L2 cache of Fiji combined with an aggressive pre-fetch from main memory. Throw in a fast retire mechanism for stale data and we have more potential space and “Just in time” data to keep the GPU from stalling. Joe said a lot of the low hanging fruit was never picked due to the abundance of available high density memory.