Some Hints as to What Comes Next
AMD’s Capsaicin event gave a glimpse of GPUs to come
On March 14 at the Capsaicin event at GDC AMD disclosed their roadmap for GPU architectures through 2018. There were two new names in attendance as well as some hints at what technology will be implemented in these products. It was only one slide, but some interesting information can be inferred from what we have seen and what was said in the event and afterwards during interviews.
Polaris the the next generation of GCN products from AMD that have been shown off for the past few months. Previously in December and at CES we saw the Polaris 11 GPU on display. Very little is known about this product except that it is small and extremely power efficient. Last night we saw the Polaris 10 being run and we only know that it is competitive with current mainstream performance and is larger than the Polaris 11. These products are purportedly based on Samsung/GLOBALFOUNDRIES 14nm LPP.
The source of near endless speculation online.
In the slide AMD showed it listed Polaris as having 2.5X the performance per watt over the previous 28 nm products in AMD’s lineup. This is impressive, but not terribly surprising. AMD and NVIDIA both skipped the 20 nm planar node because it just did not offer up the type of performance and scaling to make sense economically. Simply put, the expense was not worth the results in terms of die size improvements and more importantly power scaling. 20 nm planar just could not offer the type of performance overall that GPU manufacturers could achieve with 2nd and 3rd generation 28nm processes.
What was missing from the slide is mention that Polaris will integrate either HMB1 or HBM2. Vega, the architecture after Polaris, does in fact list HBM2 as the memory technology it will be packaged with. It promises another tick up in terms of performance per watt, but that is going to come more from aggressive design optimizations and likely improvements on FinFET process technologies. Vega will be a 2017 product.
Beyond that we see Navi. It again boasts an improvement in perf per watt as well as the inclusion of a new memory technology behind HBM. Current conjecture is that this could be HMC (hybrid memory cube). I am not entirely certain of that particular conjecture as it does not necessarily improve upon the advantages of current generation HBM and upcoming HBM2 implementations. Navi will not show up until 2018 at the earliest. This *could* be a 10 nm part, but considering the struggle that the industry has had getting to 14/16nm FinFET I am not holding my breath.
AMD provided few details about these products other than what we see here. From here on out is conjecture based upon industry trends, analysis of known roadmaps, and the limitations of the process and memory technologies that are already well known.
HBM1 is Limited in Next Gen Parts
I cannot discount the use of HBM1 technologies in certain higher end products to be used by AMD and NVIDIA, but it does appear to have too many limitations when considering these next gen parts. The biggest limitation is the 4GB of total memory that the technology currently supports. HBM2 increases this up to 32 GB, but that technology is nowhere near ready for introduction. Samsung has already started producing HBM2 parts, but quantities are unknown. SK Hynix, one of the primary partners with AMD for developing HBM, is not starting mass production of HBM2 until Q3 of this year.
The interposer and stacked memory allows high bandwidth, low latency communication with onboard memory. HBM1 is limited to 4GB though.
Power savings and PCB space are big positives for HBM, but when we consider overall bandwidth it is not that much greater than other high end GDDR-5 implementations such as the GTX 980 Ti. The FuryX features around 500 GB/sec of bandwidth while the GTX 980 Ti is not that far behind with 336 GB/sec. When we look at overall video card performance between these competing products, the differences are not that great. Everyone loves bandwidth, but it is seemingly not the limiting factor in high end implementations right now. HBM2 might expose this as a myth as it provides 1 TB/sec of bandwidth in its full implementation, but we are still many months away from having enough HBM2 memory to satisfy demand for a consumer level product.
All indications point to GDDR-5 and GDDR-5X as being the primary memory types that we will see in the next generation parts. 4GB just is not enough for these upcoming cards so HBM1 is right out. 8 GB is going to be the baseline for products ranging from $300 and up. HBM2 is not going to be available until much later this year. When we look at the situation, GDDR-5 and 5X are the only options to provide the required memory capacity. AMD and NVIDIA have relatively large caches as well as significant design expertise in memory controllers so as to offset bandwidth losses by going with GDDR-5/X
This is not to say that we may see a couple of SKUs utilize HBM1, but it is unlikely given the overall attitude towards 4 GB cards in the performance marketplace. It is also not impossible that AMD may implement a hybrid HBM system that utilizes the memory as a fast access, large cache while utilizing a larger volume of memory using GDDR-5 off the interposer. This is a much larger jump in supposition as compared to earlier statements, but it is not an impossible scenario. 2 GB of HBM using a 2048 bit connection would be a lower power, lower latency pool of memory that would enhance the performance of any GPU in most circumstances, assuming the work was done to truly optimize that configuration in drivers and hardware.
Interposer for more than Memory
The latest 14/16nm FF processes are… interesting. They are very power efficient and can clock to good speeds. What seems to be the issue is that they do not seem truly optimized for big designs. From what we have gathered, neither Polaris 10 or 11 are all that big. There seems to be no “big” GPU like what we initially saw with the previous 28 nm parts. Throughout the Capsaicin event, and in Ryan’s interview with Raja posted afterwards, we keep hearing about how smaller dies are the way to go moving forward. While we will eventually seeing larger and larger products come to market in the years ahead, it seems that right now there are some real physical and economic limits when it comes to die size.
Fiji was the first AMD GPU to integrate HBM memory into the mix. The interposer is about the max size as one can get without "stitching".
We must also consider that this is a decision based on AMD’s economic reasoning rather than real physical limitations. It could very well be that NVIDIA comes around with a monstrous sized GPU based on TSMC’s 16nm process. The rumored 17 billion transistor GPU could be a single piece of silicon that approaches 500+ mm square. This is just a rumor though. So far we have seen relatively small products being produced on these new, cutting edge processes. Even Intel has limited the size of their 14 nm products up to this point, but that will change with the 14nm Xeons make it to market. Still, we are 1.5+ years into Intel making 14 nm parts and we are just now seeing larger sized products about to be released. AMD could be risk averse with Samsung/GLOBALFOUNDRIES while NVIDIA may be in a position to gamble on a larger initial product on TSMC’s process.
Raja stated in his presentation and interview that they need to move beyond CrossFire. This to me means that they are working on a more seamless implementation of multiple chips rendering a single workload more effectively. The Radeon Pro Duo still follows the more traditional CrossFire route by utilizing xDMA over the PLX bridge chip located on the PCB. This gives relatively low latency access to the other GPU over a pretty wide interfaced (PCI-E 3.0 x16). It is a good solution, but it is not perfect.
Given the hints that we have received, it appears as though AMD might be implementing a multi-chip solution utilizing an interposer to provide very high speed and wide bus width connections between graphics chips. Interposers are not just for memory solutions, but have been shown in the past by companies like Altera to integrate individual dies on a single high-speed substrate to act as a single chip. The original idea behind that was to utilize different ASICs fabricated on process technologies that are more effective for the work rather than try to design everything onto a single die and process technology.
Ask a simple question, receive a simple answer. This in no way proves that AMD is going this route, simply that it is a possibility.
The lead in here is that there seems to be a good chance that AMD will integrate multiple chips on an interposer that allows high speed communication between the GPUs. It may also allow some extra flexibility in memory access either on the interposer in the form of a cache, or by splitting available memory channels between the two chips while communicating with external to the interposer memory. There are other more exotic potential configurations here, but from a high level this solution could work.
AMD has developed the interposer technology for years, they have good relations with the suppliers and those providing the packaging technology. But it is a logical jump to expect a multi-GPU solution using an interposer to provide communication effectively between chips, thereby moving away from the traditional CrossFire implementation that we have today. I wouldn’t mortgage my house and bet that this will be the implementation that we end up seeing, but considering what we know so far, it is certainly not beyond the realm of possibility. By utilizing many smaller dies AMD achieves better overall yields, less single-chip complexity that speeds development and fabrication, and a scalable solution that allows products to be addressed to different markets quickly.
I could be very wrong of course. Next to the Navi chip is “Scalability” which could very well relate to what I described above. I could be two generations of chips too early with this interposer speculation. AMD could be just playing it safe by introducing smaller GPUs for the budget and midrange market, and leave Fiji and Hawaii for the higher end products for the time being. Eventually AMD would have to address the high end of the market with a bigger Polaris chip, but so far we have only seen mention of the two smaller products with 10 and 11. Looking at the evidence around us, a larger GPU on 14nm is a greater undertaking than what we have seen in the past.
This could be AMD's top card for some time, if my conjecture is correct.
Needless to say, it has been many years since we have seen a process node jump like the one we will be experiencing this summer. 28nm held its own for a long time, but we are finally jumping to a new node that promises far more density and power efficiency. When we combine this with design work that has been honed by having to rely on a single process node for many years, we can expect to see very efficient and fast parts. Just how fast will depend on how big the chips can get, but for now we believe that we have at least the budget and midrange covered with Polaris 11 and 10 respectively. It also makes some sense as well that a larger GPU is not relatively close by the release of the Radeon Pro Duo card which leverages the previous generation Fiji chip to power. While that card is aimed at the pro market and those willing to spend $1500 on a single card, it is going to be the top end AMD card for the near future.
I can barely wait for June to roll around so we can finally see these chips integrated into products for mass consumption.
Why would AMD switch to HMC,
Why would AMD switch to HMC, which is a joint venture between Intel and Micron?
I don’t think they will. I
I don't think they will. I mentioned that HMC was bandied about, but I said that it had no real advantages over HBM2 and later potential iterations.
I don’t see HMC style memory
I don’t see HMC style memory being competitive with HBM. It would take a ridiculous number of HMC channels to reach HBM levels of bandwidth. HMC uses a very narrow link (8 of 16-bit) and serial, differential signaling. This makes it similar to PCI-e electrically. PCI-e is supposed to reach around 30 GB/s (x16 link) with version 4.0 in 2017, if it is not delayed. HMC, since it is board mounted and works over shorter distances, should be able to go a little bit faster. At 30 GB/s though, it would need around 36 chips (and channels) to reach the 1 TB/s of HBM2 which should be a available next year. Even if they double that speed, you are still talking about 18 channels. A high end device will have more like 8 channels.
I see a future in which you
I see a future in which you will buy processors with stacked memory.
“You want an i7 with 16 or 32GB of integrated ram?”
It’ll basically act as an L4 cache, like the EDRAM on broadwell chips. However, it will be a lot faster.
Motherboard sizes will shrink and memory bandwidth will go through the roof.
We might see this in the next 3 to 5 years.
I think an APU with HBM2 is a
I think an APU with HBM2 is a real possibility, however I would expect to see that more in pre-built systems.
16GB to act as an L4 cache? No, 16GB would be the system memory.
CPU’s already have a good handle on what cache they use and it’s generally quite small, though going forward you might want to investigate Intel’s 3D X-Point memory.
3 to 5 years?
Depends what you mean, but the industry doesn’t move really quickly. The technology gets applied where it makes the most sense. When you’re talking size we’re talking MOBILE devices usually and for that other things may matter more than memory bandwidth.
(memory bandwidth generally just keeps pace with processing performance anyway.)
HBM2 is going on AMD
HBM2 is going on AMD workstation SKU APUs on an interposer, and HPC/exascale systems where AMD can get a much better return on their HBM2 investment. HBM1 may be used for a longer period of time on gaming systems until the supply of HBM2 becomes more stable. Silicon Interposer bridge chips for inter module communication will allow for wide parallel traces/pathways between AMD’s APUs on an interposer. So the bridge interposers will be used to link up many HPC/workstation interposer based APUs.
These narrow bridge silicon interposer dies will host mostly traces and acellery fabric lines for larger systems scaling of many HPC interpoaser based APUs. Linking two GPUs/HBM via a bridge chip is probably going to return in order to provide the necessary inter GPU bandwidth over thousands of traces necessary for VR based gaming, PCIe 3.0 x16 is not going to have the bandwith/low latency necessary for VR and 4k+ gaming.
The future single card dual GPU Radeon Pro SKUs are going to be replaced with GPUs that rely on a bridge interposer for larger numbers of traces between the two GPUs so the PCB will mostly be there to host the modules an their bridge dies, PCB’s will not be able to host the thousands of parallel traces necessary to provide the bandwidth for inter GPU communication at the low clocks/low power usage that can be had via interposers/interposer bridge connectors. For users that have more than one PCI based card the bridge connector is probably return for AMD as the interface will be thousands of wires/traces wider than PCIe based solutions. The only way to deal with the bandwidth demand is to go with thousands of parallel data paths in order to keep the clocks low and the effective bandwith high, at least until optical inter-die/inter-card connection fabrics take over.
That Government exascale funding of R&D is going to work its ints way into the consumer SKUs, and AMD and others are getting million dollar grants form U-SAM to build the next generation exascale systems.
There are space limitations
There are space limitations on the interposer. I don’t know what the specified footprint in the spec is but SK Hynix’s HBM chips are about 40 mm2 while Samsung’s HBM2 chips are about 92 mm2. That means 4 stacks of HBM2 are going to take around 400 mm2. This leaves about 400 mm2 for GPUs.
Multi-GPU-on-interposer is going to be tricky for the same reason more than 4 stacks of HBM is: maximum interposer size, limited by max reticule size. Unless AMD can convince someone the interposer market is large enough to construct a high-nm fab with enormous reticules (i.e. R&D needed for a new scanner design), they’d need to dramatically reduce the size of a GPU in order to fit more than one on an interposer, and would lose the ability to use HBM alongside them.
As far as I am aware, nobody has demonstrated a technique for ‘bridging’ multiple interposers together.
Yes, there is a stitching
Yes, there is a stitching technique for making larger interposers, but it is too expensive for products like this. The largest interposer so far is around 830 mm sq. due to reticle limit. If you skip out on memory and just use that to put a couple of dies on there, then there is space for around 4 x 150 nm parts to easily fit. This is obviously an "off the cuff" estimation as the chips will have to have some space between them. But it would be relatively easy to do a smaller interposer with 2 x 250 mm sq. chips in there. Gonna be fascinating to see if they actually go this route.
That’s what the bridge
That’s what the bridge connector is for, you just place the separate interposers next to each other and use a third smaller interposer to splice the larger processor(CPU/GPU/HBM based systems) to each other. It will be easy just take a thin strip of silicon interposer and etch traces and micro-bump pads and connect the two larger interposers that way. You’ll be able to bridge many smaller interposer modules together that way. Sure it will cost more but you will be able to run 10s of thousands parallel traces between the interposer modules, way more than is feasible with PCB traces. And getting high effective and low latency bandwidth between dual GPUs via a bridge interposer is what is going to save on the power usage by allowing for lower clock speeds over ultra-wide parallel connection fabrics.
The economy of scale for this type of technology will come from those government exascale computing funds that U-Sam is liberally throwing about with its exascale initiative, and exoflop/HPC computing will pay for the R&D/IP that will find its way into the consumer SKUs. AMD’s exascale/HPC/workstation APUs on an interposer will be made into consumer variants once AMD re-enters the server market and the revenues begin to come in to better fund things.
Last I heard about
Last I heard about "stitching" an interposer is at the photomask stage. Because the lines on the interposer are so large as compared to a regular chip, you can get away with multiple exposures and having everything match up. It is simpler to do that than your "bridge". Still, a lot of silicon to process for only a couple of interposers per wafer.
SK Hynix’s HBM1 is 40 mm2 per
SK Hynix’s HBM1 is 40 mm2 per stack while their HBM2 is 92 mm2 per stack. This means that it will take around 400 mm2 for the memory. I would be interested to know what the actual max size is in the specifications. With added padding and such, the GPU may be limited to around 400 mm2 also. That is a very big GPU on 14 nm though. I don’t know how many transistors that will fit. AMD has talked about their high density libraries, so transistors per unit area comparisons with CPU processes will not be usable. I don’t think 400 mm2 is much of a limiting factor for a single GPU device on 14 nm. It will be a limiting factor for multi-GPU devices though.
Die/package sizes are from this Anandtech article:
It looks like those sizes may
It looks like those sizes may be the JEDEC specified footprint. I don’t know it that represents the actual maximum size though. The footprint specifies the exact placement of the interface micro-bumps. I don’t know if SK Hynix is working on any different HBM configurations (higher stacks or increased capacity per die), so I guess we may still be limited to 4 GB.
I would like to see
I would like to see interposer with GPU, CPU and cache.
That is kind of the plan.
That is kind of the plan.
Have you seen anything about
Have you seen anything about placing an optical interface device on an interposer? Since you can mix chips made on different processes, I have wondered if this would be possible.
Not optical, no. RF stuff
Not optical, no. RF stuff has been done I believe.
For AMD is will be an APU on
For AMD is will be an APU on an interposer, and they are already building HPC/workstation variants! With the Zen cores on one die, and the GPU on another die, add to that the HBM stacks 4 or more HBM stacks per interposer based APU. The JEDEC standard only describes what is needed to build a single HBM stack, so it’s just a matter of engineering the memory controller to use as many HBM stacks as will fit on that available interposer space after the other processor dies(CPU, GPU, FPGA, other) are added.
I expect software to really
I expect software to really push multi-GPU optimizations. I don’t just mean AFR (alternate frame rendering) or SSR.
I’m not sure if NV-LINK will allow for multiple GPU’s on the same card to work more as a single “virtual GPU” or not. (I mean NV-LINK technology only on the graphics card, not motherboard).
It’s getting quite confusing with some talk of dual-GPU cards that work in the traditional sense but also have a chip (rumored) to optimize this for VR. I suspect that would need software (i.e. game) specifically optimized for this tech though.
No need for the interposers
No need for the interposers to bridge the dies… just replace the PLX bridge with something with more traces for more bandwidth and lower latency. Considering traces for these purposes are are a max of a few centimeters instead of almost two feet (20 inches) you wouldn’t exactly be doing the impossible, especially considering you could do 8-16 times as many traces (512-1024) if not several times that at that short a distance, cheap, right there on the PCB itself, and even more on the backside of the GPU…
Not like printing PCBs on millimeter scale is hard nor expensive or not already business as usual…
PCB routing is much more
PCB routing is much more difficult than you are making it out to be. Especially in multi-chip communication such as this.
I think NVLink is only 80
I think NVLink is only 80 GB/s which is faster than pci-e, but nowhere near fast enough to share memory between chips. The local memory will be 512 GB/s to 1 TB/s. I have seen sites claiming that NVLink is 4 pci-e type links operating at 20 GB/s for the first generation. PCI-e 4.0 is supposed to reach about 30 GB/s for an x16 link in 2017, if it isn’t delayed. Even though these are only x16 links, they still take a lot of pins for the interface. NVLink seems to be 4 such links, I assume at x16, and this is probably for a high end device. These high speed serial links also require a complicated controller. If you tried to scale this up to a large number of links, it would eat up a huge amount of die area in addition to taking a lot of power and pins. Routing a huge number of these through a PCB would also be difficult. There is limited area and a limited number of layers.
I don’t know if HBM1 is
I don’t know if HBM1 is actually limited to 1 GB per stack. It sounds like it can support up to 8 stacked die, although I don’t know if current HBM chips have the TSV routing channels to support 8 high stacks. It also doesn’t seem to be limited to 2 Gb per die. These are just the characteristics of SK Hynix’s first implementation. It seems like they could double the capacity by moving to 8 high stacks or increasing the capacity per die to 4 Gb. That is, if the 4 Gb die fits within the specified footprint of HBM1. Assuming that HBM1 is limited to 1 GB a stack would be like getting a 4 GB stick of DDR4 and assuming that all DDR4 will be limited to 4 GB. As far as I can tell, SK Hynix just needs to implement larger capacity. It is also ridiculous to expect a new type of memory with each new GPU. HBM2 is not just bigger HBM; it a different spec with new features.
I guess Polaris gets HBM1
I guess Polaris gets HBM1 with Polaris 10 and GDDR5 with Polaris 11. I hope that they can figure a way to support more than 4 GB of memory, although with DX12 and Vulcan, the required memory may actually be smaller. If you throw a large number of small asynchronous compute task at the GPU, then it only really needs the memory to complete that task. It isn’t the same as DX11 where you need massive resources for each task due to task being lumped together. They may be able to support more than 4 GB several ways though. HBM1 doesn’t seem to be inherently limited to 1 GB per stack, just SK Hynix’s current implementation is.
Vega then supports HBM2 and I would expect HBM2 to be used for two generations, so Navi may actually be HBM2 also. I suspect Navi may be a more distributed compute system. Perhaps they will use multiple GPUs with high speed links between them to share the memory. This would work similarly to a multi-socket CPU board, where the links between sockets are close to the speed of the memory. You cannot do this with GPUs through a PCB. NVLink will be much faster than PCI-e, but it will be significantly slower (as in an order of magnitude) than the local memory bandwidth.
The silicon interposer can allow for high enough speed communication for GPUs to share memory. They could use two HBM2 stacks (2048-bit interface), and then use the other 2048 bits to connect to another GPU. The limitation would be the interposer size. You could probably only fit 4 HBM2 stacks and 2 GPUs unless the size of the interposer is increased significantly. They could also do something where they have 4 small GPUs with a single HBM2 stack attached to each. Navi is several years out, so we could be talking about a 10 nm chip where die sizes will be limited to very small by yields so it may make sense. Perhaps such an architecture will be implemented with Vega, and Navi will be completely different.
I have wondered if there are any plans to actually stack a GPU on top of a memory stack. HBM2 from Samsung seems to be 92 mm2. Make it a bit larger, and then stack the GPU directly on top. This woukd solve many issues, if it is possible. Stacking memory on top of the compute device dosen’t work well due to cooling issues, but it may be doable by switching it around. This would reduce the signal travel distance even more and lower the power once again. The bottom logic die would be used to communicate with other stacks rather than controlling the memory. Many such stacks could be placed on an interposer. The number of TSVs would be very large though. They would need to carry the memory signals, the GPU-to-GPU communication, and all of the power for the compute device.
So would it make sense that
So would it make sense that the Zen APUs coming next year would not actually be an integrated chip, but Zen cores on one die, and a GPU on another, connected only with the interposer? That could potentially give them ability to release more CPU/GPU combinations than before, and optimize dies better for their specific purpose. But then again I don’t think they would be able to use something like Polaris 11 chip there as a part of the APU if it only has a GDDR5 memory controller?
I believe the CPU + GPU is on
I believe the CPU + GPU is on the same die. I don’t think it’s feasible to have them as individual chips and join via an interposer.
Polaris hasn’t confirmed any HBM1 or HBM2 utilization. We may only see GDDR5 and GDDR5X here and then Vega in early 2017 as the top-end model sporting HBM2.
(there’s a video that eludes to this where the AMD guy who’s name I just forgot says “you’re a smart guy Ryan..” when asked about the Vega/HBM release and how that works.)
When HBM came out, I was
When HBM came out, I was under the impression that it was limited to 4GB because it had to live on the same die as the GPU. I figured with smaller 14 nm GPU’s and maybe even 14 nm HBM, they’d be able to fit more memory in. It seems odd to me that it’s still limited to 4GB.