Some Hints as to What Comes Next

AMD’s Capsaicin event gave a glimpse of GPUs to come

On March 14 at the Capsaicin event at GDC AMD disclosed their roadmap for GPU architectures through 2018.  There were two new names in attendance as well as some hints at what technology will be implemented in these products.  It was only one slide, but some interesting information can be inferred from what we have seen and what was said in the event and afterwards during interviews.

Polaris the the next generation of GCN products from AMD that have been shown off for the past few months.  Previously in December and at CES we saw the Polaris 11 GPU on display.  Very little is known about this product except that it is small and extremely power efficient.  Last night we saw the Polaris 10 being run and we only know that it is competitive with current mainstream performance and is larger than the Polaris 11.  These products are purportedly based on Samsung/GLOBALFOUNDRIES 14nm LPP.

The source of near endless speculation online.

In the slide AMD showed it listed Polaris as having 2.5X the performance per watt over the previous 28 nm products in AMD’s lineup.  This is impressive, but not terribly surprising.  AMD and NVIDIA both skipped the 20 nm planar node because it just did not offer up the type of performance and scaling to make sense economically.  Simply put, the expense was not worth the results in terms of die size improvements and more importantly power scaling.  20 nm planar just could not offer the type of performance overall that GPU manufacturers could achieve with 2nd and 3rd generation 28nm processes.

What was missing from the slide is mention that Polaris will integrate either HMB1 or HBM2.  Vega, the architecture after Polaris, does in fact list HBM2 as the memory technology it will be packaged with.  It promises another tick up in terms of performance per watt, but that is going to come more from aggressive design optimizations and likely improvements on FinFET process technologies.  Vega will be a 2017 product.

Beyond that we see Navi.  It again boasts an improvement in perf per watt as well as the inclusion of a new memory technology behind HBM.  Current conjecture is that this could be HMC (hybrid memory cube).  I am not entirely certain of that particular conjecture as it does not necessarily improve upon the advantages of current generation HBM and upcoming HBM2 implementations.  Navi will not show up until 2018 at the earliest.  This *could* be a 10 nm part, but considering the struggle that the industry has had getting to 14/16nm FinFET I am not holding my breath.

AMD provided few details about these products other than what we see here.  From here on out is conjecture based upon industry trends, analysis of known roadmaps, and the limitations of the process and memory technologies that are already well known.

HBM1 is Limited in Next Gen Parts

I cannot discount the use of HBM1 technologies in certain higher end products to be used by AMD and NVIDIA, but it does appear to have too many limitations when considering these next gen parts.  The biggest limitation is the 4GB of total memory that the technology currently supports.  HBM2 increases this up to 32 GB, but that technology is nowhere near ready for introduction.  Samsung has already started producing HBM2 parts, but quantities are unknown.  SK Hynix, one of the primary partners with AMD for developing HBM, is not starting mass production of HBM2 until Q3 of this year.

The interposer and stacked memory allows high bandwidth, low latency communication with onboard memory. HBM1 is limited to 4GB though.

Power savings and PCB space are big positives for HBM, but when we consider overall bandwidth it is not that much greater than other high end GDDR-5 implementations such as the GTX 980 Ti.  The FuryX features around 500 GB/sec of bandwidth while the GTX 980 Ti is not that far behind with 336 GB/sec.  When we look at overall video card performance between these competing products, the differences are not that great.  Everyone loves bandwidth, but it is seemingly not the limiting factor in high end implementations right now.  HBM2 might expose this as a myth as it provides 1 TB/sec of bandwidth in its full implementation, but we are still many months away from having enough HBM2 memory to satisfy demand for a consumer level product.

All indications point to GDDR-5 and GDDR-5X as being the primary memory types that we will see in the next generation parts.  4GB just is not enough for these upcoming cards so HBM1 is right out.  8 GB is going to be the baseline for products ranging from $300 and up.  HBM2 is not going to be available until much later this year.  When we look at the situation, GDDR-5 and 5X are the only options to provide the required memory capacity.  AMD and NVIDIA have relatively large caches as well as significant design expertise in memory controllers so as to offset bandwidth losses by going with GDDR-5/X

This is not to say that we may see a couple of SKUs utilize HBM1, but it is unlikely given the overall attitude towards 4 GB cards in the performance marketplace.  It is also not impossible that AMD may implement a hybrid HBM system that utilizes the memory as a fast access, large cache while utilizing a larger volume of memory using GDDR-5 off the interposer.  This is a much larger jump in supposition as compared to earlier statements, but it is not an impossible scenario.  2 GB of HBM using a 2048 bit connection would be a lower power, lower latency pool of memory that would enhance the performance of any GPU in most circumstances, assuming the work was done to truly optimize that configuration in drivers and hardware.


Interposer for more than Memory

The latest 14/16nm FF processes are… interesting.  They are very power efficient and can clock to good speeds.  What seems to be the issue is that they do not seem truly optimized for big designs.  From what we have gathered, neither Polaris 10 or 11 are all that big.  There seems to be no “big” GPU like what we initially saw with the previous 28 nm parts.  Throughout the Capsaicin event, and in Ryan’s interview with Raja posted afterwards, we keep hearing about how smaller dies are the way to go moving forward.  While we will eventually seeing larger and larger products come to market in the years ahead, it seems that right now there are some real physical and economic limits when it comes to die size.

Fiji was the first AMD GPU to integrate HBM memory into the mix. The interposer is about the max size as one can get without "stitching".

We must also consider that this is a decision based on AMD’s economic reasoning rather than real physical limitations.  It could very well be that NVIDIA comes around with a monstrous sized GPU based on TSMC’s 16nm process.  The rumored 17 billion transistor GPU could be a single piece of silicon that approaches 500+ mm square.  This is just a rumor though.  So far we have seen relatively small products being produced on these new, cutting edge processes.  Even Intel has limited the size of their 14 nm products up to this point, but that will change with the 14nm Xeons make it to market.  Still, we are 1.5+ years into Intel making 14 nm parts and we are just now seeing larger sized products about to be released.  AMD could be risk averse with Samsung/GLOBALFOUNDRIES while NVIDIA may be in a position to gamble on a larger initial product on TSMC’s process.

Raja stated in his presentation and interview that they need to move beyond CrossFire.  This to me means that they are working on a more seamless implementation of multiple chips rendering a single workload more effectively.  The Radeon Pro Duo still follows the more traditional CrossFire route by utilizing xDMA over the PLX bridge chip located on the PCB.  This gives relatively low latency access to the other GPU over a pretty wide interfaced (PCI-E 3.0 x16).  It is a good solution, but it is not perfect.

Given the hints that we have received, it appears as though AMD might be implementing a multi-chip solution utilizing an interposer to provide very high speed and wide bus width connections between graphics chips.  Interposers are not just for memory solutions, but have been shown in the past by companies like Altera to integrate individual dies on a single high-speed substrate to act as a single chip.  The original idea behind that was to utilize different ASICs fabricated on process technologies that are more effective for the work rather than try to design everything onto a single die and process technology.

Ask a simple question, receive a simple answer. This in no way proves that AMD is going this route, simply that it is a possibility.

The lead in here is that there seems to be a good chance that AMD will integrate multiple chips on an interposer that allows high speed communication between the GPUs.  It may also allow some extra flexibility in memory access either on the interposer in the form of a cache, or by splitting available memory channels between the two chips while communicating with external to the interposer memory.  There are other more exotic potential configurations here, but from a high level this solution could work.

AMD has developed the interposer technology for years, they have good relations with the suppliers and those providing the packaging technology.  But it is a logical jump to expect a multi-GPU solution using an interposer to provide communication effectively between chips, thereby moving away from the traditional CrossFire implementation that we have today.  I wouldn’t mortgage my house and bet that this will be the implementation that we end up seeing, but considering what we know so far, it is certainly not beyond the realm of possibility.  By utilizing many smaller dies AMD achieves better overall yields, less single-chip complexity that speeds development and fabrication, and a scalable solution that allows products to be addressed to different markets quickly.

I could be very wrong of course.  Next to the Navi chip is “Scalability” which could very well relate to what I described above.  I could be two generations of chips too early with this interposer speculation.  AMD could be just playing it safe by introducing smaller GPUs for the budget and midrange market, and leave Fiji and Hawaii for the higher end products for the time being.  Eventually AMD would have to address the high end of the market with a bigger Polaris chip, but so far we have only seen mention of the two smaller products with 10 and 11.  Looking at the evidence around us, a larger GPU on 14nm is a greater undertaking than what we have seen in the past.

This could be AMD's top card for some time, if my conjecture is correct.

Needless to say, it has been many years since we have seen a process node jump like the one we will be experiencing this summer.  28nm held its own for a long time, but we are finally jumping to a new node that promises far more density and power efficiency.  When we combine this with design work that has been honed by having to rely on a single process node for many years, we can expect to see very efficient and fast parts.  Just how fast will depend on how big the chips can get, but for now we believe that we have at least the budget and midrange covered with Polaris 11 and 10 respectively.  It also makes some sense as well that a larger GPU is not relatively close by the release of the Radeon Pro Duo card which leverages the previous generation Fiji chip to power.  While that card is aimed at the pro market and those willing to spend $1500 on a single card, it is going to be the top end AMD card for the near future.

I can barely wait for June to roll around so we can finally see these chips integrated into products for mass consumption.