Fiji: A Big and Necessary Jump
AMD has released Fiji to power the Fury series of cards
Fiji has been one of the worst kept secrets in a while. The chip has been talked about, written about, and rumored about seemingly for ages. The chip has promised to take on NVIDIA at the high end by bringing about multiple design decisions that are aimed to give it a tremendous leap in performance and efficiency as compared to previous GCN architectures. NVIDIA released their Maxwell based products last year and added to that this year with the Titan X and the GTX 980 Ti. These are the parts that Fiji is aimed to compete with.
The first product that Fiji will power is the R9 Fury X with integrated water cooling.
AMD has not been standing still, but their R&D budgets have been taking a hit as of late. The workforce has also been pared down to the bare minimum (or so I hope) while still being able to design, market, and sell products to the industry. This has affected their ability to produce as large a quantity of new chips as NVIDIA has in the past year. Cut-backs are likely not the entirety of the story, but they have certainly affected it.
The plan at AMD seems to be to focus on very important products and technologies, and then migrate those technologies to new products and lines when it makes the most sense. Last year we saw the introduction of “Tonga” which was the first major redesign after the release of the GCN 1.1 based Hawaii which powers the R9 290 and R9 390 series. Tonga delivered double the tessellation performance over Hawaii, it improved overall architecture efficiency, and allowed AMD to replace the older Tahiti and Pitcairn chips with an updated unit that featured xDMA and TrueAudio support. Tonga was a necessary building block that allowed AMD to produce a chip like Fiji.
Building a Monster
There are many subtle changes throughout Fiji when compared to older GCN based architectures, but the biggest leap in technology is obviously High Bandwidth Memory. This technology was started by AMD around 7 years ago with initial planning stages. The project started to gain a lot of steam around 5 years ago, and products outside of the GPU realm have actually integrated parts of this technology (namely the interposer) and productized it. Fiji is probably the first extremely large, high performance part that utilizes HBM.
Ryan and myself have gone over HBM pretty thoroughly. The long and short of it is that HBM is a new memory interface that brings memory closer to the chip, widening the bus, and lowering the clock speeds while increasing overall memory bandwidth and improving latency. Sounds like the next best thing to integrating memory onto the die of a GPU?
We are waiting for a deep dive on the interposer technology, but for now enjoy this purloined hot of the microbumps that AMD is implementing on Fiji.
This magic happens because of a silicon interposer and microbumps that allow thousands of data, power, and ground lines to be routed effectively. Stacked memory chips on top of logic chips, all interconnected by through-silicon-vias allow extremely wide, efficient, and fast communication with the primary GPU/controller. Fiji features a 4096 bit memory bus to 4 stacked memory modules that run at a conservative 500 MHz. The ultra-wide interface plus lower clock speed makes for an extremely power efficient setup as compared to a traditional GDDR5 interface. At 500 MHz the memory chips do not produce a lot of heat at modern process nodes. It also does not require much power again as compared to GDDR5 interfaces. AMD claims that it sees about 4X the bandwidth performance per watt over GDDR5. Seeing their implementation, I am not arguing that number.
The silicon interposer is fabricated much like modern ASICs, except the features do not need to be anywhere near as dense and fine. I believe that the current interposers used by AMD on Fiji are fabricated on an older 65 nm process, but the minimum feature size is around 100 micrometers. These are still very fine lines to use for interconnects and pathways, but it is nowhere near as complex as a modern CPU or GPU. Small defects on such a process and implementation will be very unlikely to cause an interposer to be defective, so yields in theory for the interposer should approach 100%.
The stacked memory and interposer also have a very positive effect on overall board area. Since they are all so closely linked together, and the memory chips are stacked, we have a relatively small footprint for the GPU/memory group. There is significant area savings from HBM and that results in lower PCB complexity. The entire substrate still needs to be powered with plenty of connections to the PCI-E bus, but the overall PCB design will be smaller and less complex than a GDDR5 implementation.
HBM is a fantastic technology to tackle bandwidth, power, and latency in a modern ASIC.
If there is one potential downside to this first generation of HBM is that it is limited to 4GB in this particular implementation. Some may be disappointed in this as compared to the latest R9 390 series which features 8 GB, or the GTX 980 Ti with 6 GB, or the monstrous Titan X with 12 GB of memory. There may not be need for panic here. Joe Macri talked quickly about what AMD is doing to address this particular issue. Joe mentioned that in previous generations memory space was not a big deal, as when they hit some limits they simply doubled up the memory. With GDDR5 this was fairly easy to implement going from 1GB to 2 to 3 to 4 and above. HBM 1.0 does not have this flexibility, so AMD had to do some engineering to get around this issue. Joe did not go into details about what they did, but I can take a few guesses as to how they addressed this. I think the key is most likely a combination of leveraging the larger L2 cache of Fiji combined with an aggressive pre-fetch from main memory. Throw in a fast retire mechanism for stale data and we have more potential space and “Just in time” data to keep the GPU from stalling. Joe said a lot of the low hanging fruit was never picked due to the abundance of available high density memory.
Considering that people where
Considering that people where thinking until yesterday that this was a GPU that would need 375W TDP and melt without the help of a watercooler, I would say it was a pretty well kept secret. We where expecting a big Tonga, not a big Tonga with Maxwell efficiency.
What a monster!
#RIPMAXWELL
What a monster!
#RIPMAXWELL
#LMFAO
#LMFAO
My personal take, judging
My personal take, judging from AMD’s usual language regarding these things, is that the Fury X is going to be roughly 10% faster than the GTX 780Ti. I’m hoping for more, but not expecting it.
lol if i do get this card
lol if i do get this card trust me im going to try and push it to use 375w, i want more power!if i this efficient and unlocked oh man im gona crank sooo much juice in that baby
Fury X
Fury X Benchmarks
http://videocardz.com/56711/amd-radeon-r9-fury-x-official-benchmarks-leaked
In my opinion leaked
In my opinion leaked benchmarks are pretty much worthless. I’ll wait for the testing results of PCPer and other sites. 🙂
Indeed….AMD marketing last
Indeed….AMD marketing last time showed a benchmark for its CPU Bulldozer fx-8150 showing it beats the hell out of Nehalem Core I7 only for users and reviewers to find out the benchmark result cant be reached.
Never trust benchmark.
very exciting stuff. looking
very exciting stuff. looking forward to the full review.
I hope this stuff turns out
I hope this stuff turns out to be as good as it seems. I’d love to get rid of my Nvidia cards next time I upgrade.
Speaking of the 4Gb current
Speaking of the 4Gb current limit of HBM; is it possible with the way DX12 handles memory allocation that we might see something like the Xbox One’s ESRAM/DDR3 split on PC? Maybe using GDDR5 as a sort of lower speed cache for the HBM.
I imagine it would be cost prohibitive from a hardware perspective, but maybe not.
I can definitely see the
I can definitely see the concern for 4k with only 4GB of memory. Really going to be interesting to see the new fiji chip benchmarked at 4k using GTA V and how that game loves to eat up memory. Hopefully HBM and all of the memory techniques that Josh mentioned really will make a difference.
HBM can go to 16 layers. AMD
HBM can go to 16 layers. AMD simply chose to go with only 4 layers with the first implementation.
I think we will see more layers, and smaller node HBM next year. And with smaller node, we may see the speed of the memory double or more. I fully expect to see 1TB memory bandwidth next year. And 16 GB would be possible on a single GPU as well.
Show and Tell has become Tell
Show and Tell has become Tell then Show.
So show me.
Believe you me, we want to!
Believe you me, we want to! Hopefully samples will be sent out from AMD soon to sites like this.
They sure kept this under
They sure kept this under wraps!
Especially R9 Nano!
Especially R9 Nano!
Coming dual Fiji will be
Coming dual Fiji will be interesting to see in DX12 titles which can potentially utilize majority of combined 4GB+4GB memory as one.
How could that possibly work
How could that possibly work ??
dx12 is just software. it can’t just magically rewrite hardware.
in this case we have two completely separate banks of memory. connected to a chip that is then connected to the rest of the system with a pcie bus.
for once chip to access memory on the other data has to travel over the pci bus there is no way around it. And that is going to be very slow compared to local memory regardless of what direct x version you are using.
This was true with DX11, with
This was true with DX11, with its reliance on the GPU Vendors to work out the implementation, but DX12 allows the Engine to use a custom implementation for multiple GPUs. Including sending to each GPU only the textures and meshes needed to render their portion of the scene.
With a dual card like this, you have the easiest case to test for. You have direct access to which set of memory you wish to send it to, you don’t have to mirror the memory like Crossfire and SLI did.
This is how the HUD only Intel demonstration by Microsoft at BUILD worked, they only sent the necessary data for the HUD, then sent the rendered image to the main GPU to display.
I can think of several approaches that an Engine could take, such as creating an HDR mapped image of the background assets for a level on one GPU and then use that as an LOD texture until you get closer. The point is, the engines are now free to experiment with these things.
ok but that is a long way
ok but that is a long way from saying “utilize majority of combined 4GB+4GB memory as one”.
So if developers is willing to put a significant amount of time into optimizing for multiple gpus with local memory its possible to make a bit better use of memory.
I don’t see this happening in reality but I hope i’m wrong about that.
If you have much smaller, but
If you have much smaller, but more numerous draw calls, then I would think that this would result in a finer grained split between what is sent to each gpu. I wouldn’t mind hearing from actual developers. If it supports such a massive increase in the number of draw calls, then I would think that this would also allow splitting up calls between GPUs with different resources, such as an IGP and a dedicated GPU. This isn’t unifying the memory space, but it could allow it to act mor like an 8 GB card rather than two 4 GB cards running almost in lock-step.
With the coming FAB process
With the coming FAB process node shrinks, I hope that AMD could get 2 GPU dies on a larger interposer! Imagine 2 Fiji(or future GCN) GPUs on a single larger interposer with a wire for wire connection between the two GPUs, at least for whatever necessary connection width to allow the dual GPUs more/wider bandwidth than could be had over the PCI connections on the PCB. Both GPUs could even act as one larger GPU, and this may allow AMD/others to increase yields by fabricating separate, maybe smaller, GPU dies and stacking them together on an interposer’s wide connection buses to collectively act as one bigger GPU. Modular full GPU functional units could be placed on the interposer and scale from the low to high range SKUs simply by adding more GPU dies to an interposer that was sized appropriately to accept multiples of GPUs, and along with the necessary HBM stacks, and possibility CPU cores, even APU derivatives could be made and placed on an interposer.
AMD’s whole product stack could be made modular, and placed on an interposer including specialized FPGAs and other specialized units. I’d love to see a graphics/workstation APU, Zen Based, with an added interposer based FPGA dedicated to Ray Tracing, and able to be reprogrammed when better algorithms where developed, imagine a gaming GPU with a FPGA on the interposer to implement the latest DX*/DX revisions, same for Vulkan etc. The interposer could supplant the motherboard for most logic/memory chip hosting, and with the interposer’s ability to be etched with buses measured in multi-thousands of traces wide, all these separate units on the interposer could act as one larger APU/SOC, or system on an interposer.
PCB based systems could never economically host the number of traces that could be made on an interposer, and with HBM becoming the standard on GPUs, and future APUs/SOCs, the interposer costs, along with HBM’s costs will become competitive, or even lower than regular DDR memory once the economy of scale kicks in. In the future expect interposers to also be made with logic circuits and complex connection fabrics and host grids of SOCs for supercomputer/HPC/server workloads.
That is actually one of the
That is actually one of the aims of the interposer technology. Instead of a "one size fits all" process node that needs to be highly developed and tweaked for a complex ASIC, you utilize a process node that is more efficient for a particular part of the SOC. Faster potential time to market when you use one process for example on the CPU, but a different one for the GPU portion, and yet a third for the I/O controller or analog components. Plop all of those on an inexpensive interposer and you have a potentially much more efficient design that is still high speed.
Yes I forgot about that, but
Yes I forgot about that, but interposers are the new mainboards in a limited since, with the ability to have thousands of traces wide interconnects to whatever device chips can be attached to the interposer via those microbumps.
What I am talking about is specially engineered modular GPU units of whatever fab node/microarchitecture is most recent being able to be added to a standardized interposer based ultra-wide interconnect fabric, the more GPU/other modular units the more GPU/other processing power available on a module. The interposer interconnect would in essence possess almost, if not the same effective bandwidth, as the on die interconnects with the interposer fabric freeing up more space from the mainboard. So yes the interposer will allow for each die attached to the interposer to be fabbed separately on the best suited process, but also allow for specialized engineering of smaller GPU units to be fabbed more to the wafer(with less costly yield issues), and the smaller specialized GPU units added in greater amounts to create low, medium, and high powered interposer based gaming SKUs. CPUs, and FPGAs, as well as media decoders could/will be fabbed separately also and added to the interposer. All of this modularity on the interposer will allow for faster inclusion of newer technology on later revisions of interposer based GPU lines/SKUs as the device’s maker could add the newest chip(CPU, decoder, HBM, etc.) available to the interposer package. So maybe AMD could develop a standard interposer interconnect fabric, with dedicated space for CPUs, decoder chips, GPU modules, FPGA, HBM etc. and allow for faster improvements/revisions without having to do a complete redesign/re-tape-out of the GPU portion in order to add more revisions once newer technology becomes available. Whole computing systems on a interposer module unlike the APU/SOC systems before, Could have parts of the interposer system changed out without having to do complete reworks, and re-tape-outs of the entire die, to allow for more incremental improvements, before the next generation of GPU/CPU/decoder/etc. microarchitectures are available.
This is a pretty exciting one
This is a pretty exciting one too. Alteras Stratix FPGA is using it as well as Knights Landing. Its even better than conventional interposers.
http://www.intel.com/content/www/us/en/foundry/emib.html
Looks impressive on paper!
Looks impressive on paper! The H.265 video decode accelerator is a good addition for sure! 275W TDP awesome! Fiji looks truly like a top tear product.
Now my question is: How likely is it to get Fiji-like performance on AMD’s next gen APUs based on Sammy’s 14nm finfet fabrication process?
That would be awesome! It
That would be awesome! It would be great to finally have discrete level performance out of APUs.
Here’s a question for you.
Here’s a question for you. How much of Intel’s speed boost in recent years is from the Intel design wins, and how much are from the process shrink allowing to pile on more GPU cores?
The answer should be close to the same. Just by shrinking the current design down, which is admittedly oversimplifying as just shrinking doesn’t really work, you are able to fit a larger number on the chip. And the shrink also provides power usage decreases, so the chip can handle more cores as well.
But you might be missing the biggest thing that was said on the stream. The AMD CEO stated they would be bringing HBM to their APUs.
That will probably come next year with the Zen architecture, but that would give them access to HBM 2.0, which means 16GB of memory. They can use this as a level 3 cache, or as main system memory, for say portables and small form factors.
But this is the important part. If you thought that GPUs memory is constrained, CPU memory bandwidth is literally choking in comparison.
Consider the Intel Haswell-E i7-5960x. This beast that is the gaming enthusiast wet dream currently maxes out at 68 GB/s of memory bandwidth, if you use Quad Channel configuration. The current gen of HBM is 512 GB/s, and that is set to increase for HBM 2.0.
Intel may have the fastest single process IPC currently, but that advantage will go down majorly by feeding the CPU almost 10X as much data. Yes, Intel will probably jump on the bandwagon, but this is major for all markets, and one of the reasons we need competition.
Now, for the sakes of gamers, I hope AMD uses this as l3 cache, at least in the highest end devices, as I would hate to be limited to 16GB of memory. (Yes, i do use that much and more, but that is neither here nor there.)
But this would seem to be the reason to move to a more SOC design in the mobile APUs.
Are CPUs actually memory
Are CPUs actually memory bandwidth constrained? I thought they weren’t, but I may be wrong.
The CPU doesn’t actually need
The CPU doesn’t actually need that much more bandwidth unless you are running ridiculous number of cores. In the consumer space, increasing memory bandwidth has made little difference for non-streaming applications since they are very cachable and run mostly from cache. Also, increasing single thread performance has gotten very difficult and has already been pushed to the limits.
HBM will make a huge difference for APUs though. It will allow an APU to reach dedicated graphics performance levels. In fact, it will actually be considerably more efficient in many ways. What I want is a laptop with an HBM powered APU. With HSA, the CPU can just pass pointers to memory rather than copying everything to a separate memory space dedicated to the GPU. You get this copying even with shared memory on non-HSA systems. It is very wasteful to keep two copies of everything. HSA allows zero-copy, which will be significantly more efficient on memory space. It will also be significantly more power efficient since it does not have to do the copy. This would be a smaller difference compared to an IGP without HSA but it will be a significant difference compared to a dedicated gpu. With a dedicated gpu, everything must be copied over PCI-e from system memory to gpu memory over high-power PCB level interfaces. This waste a huge amount of power. With HBM and HSA, the CPU just passes a pointer to the memory and the gpu accesses it directly.
These APUs will be much more power efficient and much more efficient on memory space, while delivering dedicated gpu levels of performance. Looking at Fiji, you could easily fit a CPU core on the interposer with a slightly smaller gpu. At 14 nm, Intel’s broadwell is only 82 mm2. Zen at 14 nm will probably be similar sized. Because of the memory savings from HSA, 16 GB will be plenty for the entire system. Combine that with a fast ssd, and you could have a very powerful system in a tiny footprint.
Yes and getting more GPU
Yes and getting more GPU power with a discrete GPU and HBM on the interposer could also come with some extra CPU processing power added on the interposer, and the discrete gaming PCI based GPU becoming a gaming APU/system in its own right. Imagine having the gaming OS and the gaming engine all running on the discrete card, at the lowest latency possible between GPU and CPU because they are right next to each other on the interposer, along with say 16 GBs of HBM to host the gaming OS/gaming engine, and frame buffer. Now imagine 2 of those systems sitting in the PCIe slots with the gaming engine, and gaming OS designed to shift to cluster computing/gaming mode and split the gaming load across both cards, all while not having to rely on the motherboard CPU for any gaming functionality, except for a limited support if needed. The motherboard CPU could be a bog standard quad core i5, or AMD equivalent, with the game running on/across one or more discrete gaming APUs on the interposer/s on the discrete card/s.
I’d love to get extra discrete CPU cores to go along with the discrete GPU/s and forget about the latency issues between the mainboard CPU, and the discrete GPU, and the gaming engines/gaming OSs running on the individual discrete gaming APUs would be able to balance the load across the discrete gaming APUs gaming cluster style. Just imagine 4 full fat Zen cores per gaming APU.
If you do not see gaming systems like this derived from those Zen based Greenland graphics workstation/HPC APU SKUs it’s not to hard to imagine something similar for gaming! Because, AMD’s firepro workstation systems are going this direction with high end Graphics APUs on PCI card based SKUs for the professional graphics market, and the HPC/supercomputer market. Expect FPGAs to begin showing up on the interposer module too, for certain specialized workloads, in HPC, and server SKUs, and probably Graphics workstation systems also(Ray Tracing on the FPGAs).
The H.265 video decode
The H.265 video decode accelerator is a good addition for sure!
GOOD ADDITION TO WHAT ?
H265 Part one from Blu-ray Ultra or see 4k tv online
if you dont have HDCP 2.2 AND HDMI 2.0 What is it worth that you have h265 ?
Mr Walrath thanks for a great
Mr Walrath thanks for a great article. thats it
Great article as always
Great article as always Joshua.
Any word on FP64 performance?
Any word on FP64 performance?
I believe it is the same
I believe it is the same ratio that we saw with Hawaii.
No HDMI 2.0 support.
RIP AMD
No HDMI 2.0 support.
RIP AMD
HDMI 2.0? HDMI isn’t anything
HDMI 2.0? HDMI isn’t anything special in general. I see it as an easy to connect video cable with audio support and that’s all it really is. 2.0 supports 4K which is barely a standard.
DisplayPort is where it is at and where it has been at! DVI is great too still.
AMD will surely not come close to dying because the lack of HDMI 2.0.
(This message was not written to offend or be hurtful in any way but it is the truth)
*port, not “cable”
(edit: I
*port, not “cable”
(edit: I didn’t want to look that stupid, even though to a lot I probably still do 🙂 but it’s the internet, none of should care that much)
Does it support Displayport
Does it support Displayport 1.3? Then it supports HDMI 2.0. Of did you mean HDMI 2.0a?
It has Display Port 1.2a
It has Display Port 1.2a witch carries 21.6Gbps vs HDMI 2.0’s 18Gbps.
The questions is can DP 1.2a be converted into HDMI 2.0?
It has more than enough bandwith to do it.
I mean for resolution I know you will loose the HDMI functions like CEC ARC etc…
20 million televisions were
20 million televisions were sold with HDMI 2.0
Display Port 1.2a nothing to do with the television
its work with 300MHZ and HDMI 2.0 work with 600mhz
Where does this leave use
Where does this leave use with custom water loops? Do we have to buy a Fury X and rip off that AIO cooler or can we get a Fury and take off the air cooler? Is the Fury cut down from the Fury X?
These questions are gonna
These questions are gonna have to be answered later. The waterblock guys are likely only now designing their solutions. We still don't know what Fury (minus the X) will look like or the air cooling solution being put forward for it.
And AMD might be leaving this
And AMD might be leaving this up to the vendors. After all, vendors can partner with the current cooling solutions to differentiate.
Hopefully the pump is in the
Hopefully the pump is in the radiator.
If it is in the radiator then you can just snip the water lines and put and adapter to fit it to your loop.
Or like usual you will see custom GPU water blocks.
Josh, Excellent, well written
Josh, Excellent, well written and extremely fair article! I’m really curious how the Fury XT will stack up against the GTX980 TI (its true competitor) both at stock speeds and with both overclocked.
As far as custom water loops (I have 2 R9 290s with EK blocks) I think we are out in the cold with the Fury XT and will have to wait for aftermarket custom water blocks for the Fury Pro.
Thanks! I too am looking
Thanks! I too am looking forward to some hard numbers between Fury X and 980Ti. Gonna be quite the battle!
Also agree that failure to
Also agree that failure to have HDMI 2 support on a high end card like the Fury XT must be addressed by AMD.
NO HDMI 2.0 NO DP 1.3 Huge
NO HDMI 2.0 NO DP 1.3 Huge disappointment
NVIDIA ALSO NO hdmi 2.0
NVIDIA ALSO NO hdmi 2.0
Yes they do since the launch
Yes they do since the launch of the GTX900 series
http://www.anandtech.com/show
http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/5
How NVIDIA has HDMI 2.0 ?
If
How NVIDIA has HDMI 2.0 ?
If you can not play UHD Bluray or HDR or Netflix 4K with NVIDIA 960/970/980
what HDCP 2.2 it is ?
what hdmi 2.0 it is ?
you can’t have DCI – P3 coloer
NVIDIA data bandwidth 10.2 GB
all information is cut Above 10.2GBPS
so what HDMI 2.0 it is ? nvidia style ?
why nvidia no registers data bandwidth ? 8.75 ? 10.2 ?
The best thing for everyone
The best thing for everyone is if we have performance crown swapping between NVidia and AMD regularly. Better pricing, quicker introduction of new products.
I’ll reserve judgement until reviewers have put it through its paces, but I’m hopeful!
I’m probably most surprised
I'm probably most surprised about the claimed lower power numbers as compared to the older Hawaii based R9 290s. I like progress in areas such as this!
Does Fiji support HDCP 2.2?
Does Fiji support HDCP 2.2?
Why would you want DRM?
Why would you want DRM?
NO
NO
NO
IF YOU NOT HAVE HDMI 2.0
NO
IF YOU NOT HAVE HDMI 2.0 NO HDCP 2.2?
It Have HDMI 2.0?
It Have HDMI 2.0?
Nope.
Nope.