High Bandwidth Memory
AMD shared it plans and details of its first HBM implementation with us. Come see what kind of memory system will power Fiji!
UPDATE: I have embedded an excerpt from our PC Perspective Podcast that discusses the HBM technology that you might want to check out in addition to the story below.
The chances are good that if you have been reading PC Perspective or almost any other website that focuses on GPU technologies for the past year, you have read the acronym HBM. You might have even seen its full name: high bandwidth memory. HBM is a new technology that aims to turn the ability for a processor (GPU, CPU, APU, etc.) to access memory upside down, almost literally. AMD has already publicly stated that its next generation flagship Radeon GPU will use HBM as part of its design, but it wasn’t until today that we could talk about what HBM actually offers to a high performance processor like Fiji. At its core HBM drastically changes how the memory interface works, how much power is required for it and what metrics we will use to compare competing memory architectures. AMD and its partners started working on HBM with the industry more than 7 years ago, and with the first retail product nearly ready to ship, it’s time to learn about HBM.
We got some time with AMD’s Joe Macri, Corporate Vice President and Product CTO, to talk about AMD’s move to HBM and how it will shift the direction of AMD products going forward.
The first step in understanding HBM is to understand why it’s needed in the first place. Current GPUs, including the AMD Radeon R9 290X and the NVIDIA GeForce GTX 980, utilize a memory technology known as GDDR5. This architecture has scaled well over the past several GPU generations but we are starting to enter the world of diminishing returns. Balancing memory performance and power consumption is always a tough battle; just ask ARM about it. On the desktop component side we have much larger power envelopes to work inside but the power curve that GDDR5 is on will soon hit a wall, if you plot it far enough into the future. The result will be either drastically higher power consuming graphics cards or stalling performance improvements of the graphics market – something we have not really seen in its history.
While it’s clearly possible that current and maybe even next generation GPU designs could still have depended on GDDR5 as the memory interface, the move to a different solution is needed for the future; AMD is just making the jump earlier than the rest of the industry.
But GDDR5 also limits GPU designs and graphics card designs in another way: form factor. Implementing a high performance GDDR5 memory interface requires a large number of chips to reach the required bandwidth levels. Because of that, PCB real estate becomes a concern and routing those traces and chips on a board becomes complicated. And the wider the GPU memory interface (256-bit, 384-bit), the more board space is taken up for the memory implementation. As frequencies increase and power draw goes up on GDDR5, the need for larger voltage regulators becomes a concern.
This diagram provided by AMD shows the layout of the GPU and memory chips required to get the rated bandwidth for the graphics card. Even though the GPU die is a small portion of that total area, the need to surround the GPU by 16 DRAM chips, all equidistant from their GPU PHY locations, takes time, engineering and space.
Another potential concern is that GDDR5 memory performance scaling above where we reside today will cause issues with power. More bandwidth requires more power and DRAM power consumption is not linear; you see a disproportionate increase in power consumption as the bandwidth level rises. As GPUs increase compute rates and games demand more pixels for larger screens and higher refresh rates, the demand for more memory bandwidth is not stabilizing and certainly isn’t regressing. Thus a move to HBM makes sense, today.
Historically, when technology comes to an inflection point like this, we have seen the integration of technologies on the same piece of silicon. In 1989 we saw Intel move cache and floating point units onto the processor die, in 2003 AMD was the first to merge the north bridge and memory controller into a design, then graphics, the south bridge even voltage regulation – they all followed suit.
But on-chip integration of DRAM is problematic. The process technology used for GPUs and high performance processors differs greatly from that use on DRAM chips, traditionally. Density of transistors for a GPU is not nearly at the level of density for DRAM and thus putting both on the same piece of silicon would degrade the maximum quality and performance (or power consumption) of both. It might be possible to develop a process technology that does work with both at the same level as current integrations but that would drive up production cost – something all parties would like to avoid.
The answer for HBM is an interposer. The interposer is a piece of silicon that both the memory and processor reside on, allowing the DRAM to be in very close proximity to the GPU/CPU/APU without being on the same physical die. This close proximity allows for several very important characteristics that give HBM the advantages it has over GDDR5. First, this proximity allows for extremely wide communication bus widths. Rather than 32-bits per DRAM we are looking at 1024-bits for a stacked array of DRAM (more on that in minute). Being closer to the GPU also means the clocks that regulate data transfer between the memory and processor can be simplified, and slower, to save power and complication of design. As a result, the proximity of the memory means that the overall memory design and architecture can improve performance per watt to an impressive degree.
Integration of the interposer also means that the GPU and the memory chips themselves can be made with different process technologies. If AMD wants to use the 28nm process for its GPU but wants to utilize 19nm DRAM, it can do that. The interposer itself, also made of silicon, can be built on a much larger and more cost efficient process technology as well. For AMD’s first interposer development it will have no active transistors and essentials acts like a highway for data to move from one logic location to another: memory to GPU and back. At only 100 microns thick, the interposer will not add much to the z-height of the product and with tricks like double exposures you can build a interposer big enough for any GPU and memory requirement. As an interesting side note, AMD’s Joe Macri did tell me that the interposer is so thin that holding it in your fingers will result in a sheet-of-paper-like flopping.
AMD’s partnerships with ASE, Ankor and UMC are responsible for the manufacturing of this first interposer – the first time I have heard UMC’s name in many years!
So now that we know what an interposer is and how it allows the HBM solution to exist today, what does the high bandwidth memory itself bring to the table? HBM is DRAM-based but was built with low power consumption and ultra wide bus widths in mind. The idea was to target a “wide and slow” architecture, one that scales up with high amounts of bandwidth and where latency wasn’t as big of a concern. (Interestingly, latency was improved in the design without intent.) The DRAM chips are stacked vertically, four high, with a logic die at the base. The DRAM die and logic die are connected to each other with through silicon vias, small holes drilled in the silicon that permit die to die communication at incredible speeds. Allyn taught us all about TSVs back in September of 2014 after a talk at IDF and if you are curious in how this magic happens, that story is worth reading.
Note: In reality the GPU die and the HBM stack are approximately the height
Where the HBM stack logic die meets the interposer, micro-bumps are used for a more traditional communication, power transfer and installation method. These pads are also used to connect the GPU/APU/CPU to the interposer and the interposer to the package substrate.
Moving the control logic of the DRAM to the bottom of the stack allows for better utilization of die space as well as allowing for closer proximity of the PHYs (the physical connection layer) of the memory to the matching PHYs on the GPU itself. This helps to save power and simplify design.
Each HBM memory stack of HBM 1 (more on that designation later) is comprised of four 256MB DRAMs for a total of 1GB of memory per stack. When compared to a single DRAM of GDDR5 (essentially a stack of one), the HBM offering changes specifications in nearly every way. The width of the HBM stack is now 1024-bits though clock speed is reduced substantially to 500 MHz. Even with GDDR5 is hitting clock speeds as high as 1750 MHz, the bus width offsets that change in favor of HBM, resulting in total memory bandwidth per stack of 128 GB/s, compared to 28 GB/s per chip on GDDR5. Because of the changes to clocking styles and rates, the HBM stacks can operate at 1.3v rather than 1.5v.
The first iteration of HBM on the flagship AMD Radeon GPU will include four stacks of HBM, a total of 4GB of GPU memory. That should give us in the area of 500 GB/s of total bandwidth for the new AMD Fiji GPU; compare that to the R9 290X today at 320 GB/s and you’ll see a raw increase of around ~56%. Memory power efficiency improves at an even great rate: AMD claims that HBM will result in more than 35 GB/s of bandwidth per watt of power consumption by the memory system while GDDR5 only gets over 10 GB/s.
Physical space savings are just as impressive for HBM over current GDDR5 configurations. 1GB GDDR5 DRAM chip takes about 28mm x 24mm of space on a PCB with all four 256MB packages laid out on the board. The 1GB HBM stack takes only 7mm x 5mm of space, a savings of 94% in terms of surface area. Obviously that HBM stack has to be placed on the interposer chip itself, not on the PCB of the graphics card, but the area saved is still accurate. Comparing the full implementation of Hawaii and 16 DRAM packages for GDDR5 to Fiji with its HBM configuration shows us why AMD was adamant that form factor changes were coming soon. What an HBM-enabled system with 4GB of system memory can do in under 4900 mm2 would take 9900 mm2 to implement with GDDR5 memory technology. It’s easy to see now why the board vendors and GPU designers are excited about new places that discrete GPUs could find themselves.
Besides the spacing consideration and bandwidth improvements, there are likely going to be some direct changes to the GPUs that integrate support for HBM. Die size of the GPU should go down to some degree because of the memory interface reduction. With more simplistic clocking mechanisms and lower required clock rates, as well as with much finer pitches coming in through the GPUs PHY, integration of memory on an interposer can change die requirements for memory connections. Macri indicated that it would be nearly impossible for any competent GPU designer to build a GPU that doesn’t save die space with a move to HBM over GDDR5.
Because AMD isn’t announcing a specific product using HBM today, it’s hard to talk specifics, but the question of total power consumption improvements was discussed. Even though we are seeing drastic improvements in memory system power consumption, the overall effect on the GPU will be muted somewhat as the total power draw a memory controller on a GPU is likely under 10% of the total. Don’t expect a 300 watt GPU that was built on GDDR5 to translate into a 200 watt GPU with HBM. Also interesting, Macri did comment that the HBM DRAM stacks will act as a heatsink for the GPU, allowing the power dissipation of the total package and heat spreader to improve. I don’t think this will mean much in the grand scheme of high performance GPUs but it may help AMD deal with power consumption concerns that have plagued them in the last couple of generations.
Moving to a GPU platform with more than 500 GB/s of memory bandwidth gives AMD the opportunity to really improve performance in key areas were memory utilization are at their peak. I would assume that we would see 4K and higher resolution performance improvements over the previous generation GPUs where memory bandwidth is crucial. GPGPU applications could also see performance scaling above what we normally see as new GPU generations release.
An obvious concern is the limit of 4GB of memory for the upcoming Fiji GPU – even though AMD didn’t verify that claim for the upcoming release, implementation of HBM today guarantees that will be the case. Is this enough for a high end GPU? After all, both AMD and NVIDIA have been crusading for larger and larger memory capacities including AMD’s 8GB R9 290X offerings released last year. Will gaming suffer on the high end with only 4GB? Macri doesn’t believe so; mainly because of a renewed interest in optimizing frame buffer utilization. Macri admitted that in the past very little effort was put into measuring and improving the utilization of the graphics memory system, calling it “exceedingly poor.” The solution was to just add more memory – it was easy to do and relatively cheap. With HBM that isn’t the case as there is a ceiling of what can be offered this generation. Macri told us that with just a couple of engineers it was easy to find ways to improve utilization and he believes that modern resolutions and gaming engines will not suffer at all from a 4GB graphics memory limit. It will require some finesse from the marketing folks at AMD though…
The Future
High bandwidth memory is clearly the future of high performance GPUs with both AMD and NVIDIA integrating it relatively soon. AMD’s Fiji GPU will include it this quarter and NVIDIA’s next-generation Pascal architecture will use it too, likely released in 2016. NVIDIA will have to do a bit of management of expectations with AMD being the first out the gate and AMD will be doing all it can to tout the advantages it offers over GDDR5. And there are plenty.
HBM has been teased for a long time…
I’ll be very curious how long it takes HBM to roll out to the entire family of GPUs from either company. The performance advantages high bandwidth memory offers come at some additional cost, at least today, and there is no clear roadmap for getting HBM to non-flagship level products. AMD and the memory industry see HBM as a wide scale adoption technology and Macri expects to see not only other GPUs using it but HPC applications, servers, APUs and more. Will APUs see an even more dramatic and important performance increase when they finally see HBM implemented on them? With system memory as the primary bottleneck for integrated GPU performance it’s hard to not see that being the case.
When NVIDIA gets around to integrating HBM we’ll have another generational jump to HBM 2 (cleverly named). The result will be stacks of 4GB each and bandwidth increases along the same multiplier. That would alleviate any concerns over memory capacities on GPUs using HBM and improve the overall bandwidth story yet again; and all of that will be available in the next calendar year. (AMD will integrate HBM 2 at that time as well.)
AMD has sold me on HBM for high end GPUs, I think that comes across in this story. I am excited to see what AMD has built around it and how this improves their competitive stance with NVIDIA. Don’t expect to see dramatic decreases in total power consumption with Fiji simply due to the move away from GDDR5, though every bit helps when you are trying to offer improved graphics performance per watt. How a 4GB limit to the memory system of a flagship card in 2015-2016 will pan out is still a question to be answered but the additional bandwidth it provides offers never before seen flexibility to the GPU and software developers.
June everyone. June is going to be the shit.
I love how every commenter
I love how every commenter does his/her best to find and highlight the single drawback (4gb) of a completely new evolutionary improvement.
Well All certain sides fans
Well All certain sides fans boyz did same thing when gtx980 with 4gb ram was released yet since its AMD doing this time its fine. pretty clear where the problem is.
I think the 4G HBM1 of the
I think the 4G HBM1 of the 390x will probably perform better than 4G DDR5. However, how will AMD explain the huge gap between the 390x 4G HBM1 vs the Nvidia Titan X with 12 G DDR5?
BTW, I have 2 Sapphire Tri-X R9 290s water cooled CF in my rig so I’m hardly a Nvidia fanboy. Nonetheless,. that’s going to be a lot of memory territory to make up.
4gb HBM will be better then
4gb HBM will be better then gddr5 on games that use more then 4gb but it still gonna start to suffer when game needs 4gb+. Start to see that some in 1440p/4k which these cards are more and more the focus for.
quite easy really HBM has 3
quite easy really HBM has 3 times the data transfer rate of GDDR5 (HBM=1Tb/s GDDR5=336Gb/s) so you dont need 12GB Ram because GDDR5 is already got to the point of diminishing returns which is why Both AMD/Nvidia have until now gone the “add more Vram” route.
In short its the data transfer rate thats been holding cards back so 4GB of HBM should be able to read and access that data in the same amount of time that 12GB of GDDR5.
“At only 100 microns thick,
“At only 100 microns thick, the interposer will not add much to the z-height of the product and with tricks like double exposures you can build a interposer big enough for any GPU and memory requirement. As an interesting side note, AMD’s Joe Macri did tell me that the interposer is so thin that holding it in your fingers will result in a sheet-of-paper-like flopping.”
Tech report states this differently:
“Macro said those storage chips are incredibly thin, on the order of 100 microns, and that one of them “flaps like paper” when held in the hand.”
It is the memory die which are super thin, not the interposer. It makes sense that memory die are really thin since a stack of 5 die (4 dram + 1 logic) is the same height as the gpu die. I believe these die are made by etching holes into the silicon wafer and filling with metal for the TSV. Then they build the dram on top. The wafer is then flipped over and polished down to expose the TSVs. The bottom logic die has TSVs also, but the GPU does not need them, so the gpu die is much thicker. I would expect the interposer to be quite thick for mechanical and thermal stability.
Also, I don’t know if the interposer size is that big of a limitation. From a previous discussion, after Josh’s comments about interposer production (which seems to have been totally wrong), it seems that the maximum size using a single reticule is over 800 square mm. This would only limit the gpu to 600 square mm, which is huge.
HBM could be a massive change in many different market segments and it is going to cause a lot of confusion. The media really needs to try and keep the facts straight and avoid semantic difficulties with the terminology.
I have seen this reported in
I have seen this reported in yet another way, so it is unclear what was actually said. The interposer uses TSVs also, so it could be very thin. The TSVs in the interposer are much larger than those in the memory stack though. They are only for connections routed outside the package so this is a rather small number comparatively speaking. It will only be the PCIe interface, the display outputs, and power.
The signal to noise ratio is
The signal to noise ratio is so low in threads involving AMD that it is difficult to find any post I actually want to read. It would be nice to be able to collapse all of these fanboy, troll, and FUD threads. I have worked at tech companies before where the sales guys admitted to getting on forums and spreading FUD about the competition in their spare time. Add in the Fanboys and trolls and it is a total mess. Anyone who is just a normal enthusiast interested in the technology, you might as well not bother reading any of these forum post.
I have occasionally found these post interesting from a phycological perspective though. Are they here posting because they have one of the products and feel the need to defend it through some kind of post-purchase rationalization (“I made the decision to buy it, so it must be the best”)? Are they Nvidia or AMD employees who hold a bunch of company stock? Are they just trolls stirring up trouble because they enjoy arguing?
Anyway, it am always surprised that people don’t recognize marketing tactics. Nvidia comes out with a new game bundle which is really nice right before a big AMD release. This is obviously to get people to buy ther product now instead of waiting to see what AMD releases. If you buy a couple hundred dollar gpu now, are you going to upgrade to a new card a month later, even if it is significantly better?
HBM is going to be
HBM is going to be interesting for a lot of markets. It will make APUs as powerful as dedicated graphics, so for mobile, we will probably see single package APUs which include CPU, GPU, southbridge. The only off package interconnect would be the IO stuff. It wouldn’t need a PCI-e link for graphics so it would only need some PCI-e links for storage. This will make a very powerful system in a very small size. Also, there is still nothing stopping them from routing a memory controller off the SoC package for more memory and using the HBM as a giant L4 cache the way Intel’s crystalwell integrated graphics work. The package pin count would be quite low without an external memory interface, so technically They could add some memory on the PCB if the on-package HBM isn’t large enough.
The HBM would not really be that useful on the CPU for consumer applications which are not streaming. Most non-streaming applications run from on-die caches with hit rates in the 99% range, which is why increasing system memory speed has not been increasing performance much. It obviously will help with streaming apps, but these are likely to be running on the on-die gpu, not the CPU.
HBM does not replace CPU caches. It will be much lower latency than system memory, but higher latency than on-die SRAM caches. HBM is still DRAM which is higher base latency than SRAM. The ability to keep a much larger number of “pages” (not sure what terminology they are using) open has advantages to latency, but only for applications which are not bottlenecked by the on-die caches. This is mostly big server/workstation applications which require random access to large data structures.
I can see why Intel would be dragging their feet on this type of memory tech. It would reduce the dependence on large on-die caches so it could reduce the demand CPUs with large on-die L3 caches. These can cost up to around $7000. I suspect a performance competitive HBM server chip will be a lot cheaper, so Intel will lose their huge margins. This plus the IGP rising to dedicated performance levels is why HBM could be such a disruptive technology.
This is only half the real
This is only half the real storey.
AMD can use the interposer as a replacement for the system memory bus while placing HBM on-die with the APU or CPU. This eliminates the need for system ram. At a huge cost benefit gained from a smaller motherboard.
Also HBM on die with a Tablet APU would also see a huge energy savings benefit gained by eliminating Tablet RAM.
Placing 64-128 gigs of HBM on a ZEN Server CPU would give STAGGERING performance not to mention enormous energy savings!
I think that AMD has an opportunity to redefine SOC with Integrated High Bandwidth System Memory; IHBSM.
This AMD VS Nvidia shyt is
This AMD VS Nvidia shyt is just so crazy, way too many ignorant fanboys on both ends. AMD dies, Nvidia becomes more of a monopoly and can screw PC gaming over in price and innovation, we need both to push new things and compete in pricing, so this one must die level of BS needs to stop.
I like the potential of what
I like the potential of what this new tech can bring so I am waiting on AMD’s Fuji XT GPU to see if AMD can actually FULLY implement this GPU to take full advantage of this. AMDATI has over the years been a leader of innovation in the CPUGPU venue from a hardware perspective but they have been victim to poorly implementing their innovations to the point of falling prey to the competition (Intel, Nvidia)doing a better job of IMPLEMENTATION of their innovations.
I really hope that this time AMD does a better job of implementation of their tech. When it comes out we will see.
Can this development be
Can this development be adapted to non-volatile DRAM e.g. Everspin’s ST-MRAM, HP’s memristor, and Crossbar’s RRAM?
A high-density 3D NVRAM sounds like a worthy challenge
for the “bleeding edge” R&D folks.
See: http://www.technologyreview.com/featuredstory/536786/machine-dreams/
“Failure is a postponed success.”
— Fr. John Eugene O’Toole
(my seminary Latin teacher)
MRFS
Anyone else thinks its
Anyone else thinks its hilarious for AMD to rely on software (engine/driver) optimizations to not run out of memory on a new high-end card in 2015?
“Macri doesn’t believe so; mainly because of a renewed interest in optimizing frame buffer utilization.”
If the past is any indication, AMD’s interests don’t correlate well with their actual abilities.
“Macri admitted that in the past very little effort was put into measuring and improving the utilization of the graphics memory system, calling it “exceedingly poor.””
I’m going to make a prediction:
AMD is going to dump this card on the market and forget about drivers for at least another 6 months, at which point they’ll consider WHQL-ing whatever the poor interns in the basement have come up with at that point.