GP100, the “Big Pascal” chip that was announced at GTC, will be coming to PCIe for enterprise and supercomputer customers in Q4 2016. Previously, it was only announced using NVIDIA's proprietary connection. In fact, they also gave themselves some lead time with their first-party DGX-1 system, which retails for $129,000 USD, although we expect that was more for yield reasons. Josh calculated that each GPU in that system is worth more than the full wafer that its die was manufactured on.
This brings us to the PCIe versions. Interestingly, they have been down-binned from the NVLink version. The boost clock has been dropped to 1300 MHz, from 1480 MHz, although that is matched with a slightly lower TDP (250W versus the NVLink's 300W). This lowers the FP16 performance to 18.7 TFLOPs, down from 21.2, FP32 performance to 9.3 TFLOPs, down from 10.6, and FP64 performance to 4.7 TFLOPs, down from 5.3. This is where we get to the question: did NVIDIA reduce the clocks to hit a 250W TDP and be compatible with the passive cooling technology that previous Tesla cards utilize, or were the clocks dropped to increase yield?
They are also providing a 12GB version of the PCIe Tesla P100. I didn't realize that GPU vendors could selectively disable HBM2 stacks, but NVIDIA disabled 4GB of memory, which also dropped the bus width to 3072-bit. You would think that the simplicity of the circuit would want to divide work in a power-of-two fashion, but, knowing that they can, it makes me wonder why they did. Again, my first reaction is to question GP100 yield, but you wouldn't think that HBM, being such a small part of the die, is something that they can reclaim a lot of chips by disabling a chunk, right? That is, unless the HBM2 stacks themselves have yield issues — which would be interesting.
There is also still no word on a 32GB version. Samsung claimed the memory technology, 8GB stacks of HBM2, would be ready for products in Q4 2016 or early 2017. We'll need to wait and see where, when, and why it will appear.
Well with the
Well with the Power8/Power9(IBM, and Some third party power8/9 licensees) systems the only ones currently supporting NVlink these new Nvidia GP100 SKUs should not cost as much, so Nvidia is going after the current market of designs that are not using NVlink or CAPI. I see Nvidia going for the traditional higher end workstation market with this SKU, with the Nvlink enabled mezzanine module GP100’s going in the Supercomputer/high end HPC variants. I think that the reduced number of HBM2 stacks may have a cost reason, but most certainly has an HBM2 supply reason, as Nvidia maybe be trying to get the costs down to get a broader market segment of PCIe only based systems.
The HBM/HBM2 JEDEC standards only deals with what is necessary to have one HBM stack! Thus the JEDEC HBM/HBM2 standards leaves it up to the manufacturer of the GPU/Processor system the decision to have from 1 to 4, up to even 5, 6 or more HBM/HBM2 stacks, with the processor’s maker in control over designing a memory controller to handle the total number of HBM/HBM2 stacks. HBM2 can be clocked higher than HBM, so any usage of less HBM2 stacks that may lead to lower effective bandwidth can at least be offset by higher clocks for systems that may only be using for example 2 HBM2 stacks. The is a lot more flexibility with the HBM2 stacks than there was with HBM.
Nvidia probably has some lower binned parts with defective memory management units, or they are just trying to so save costs and making dew with a limited supply of HBM2 stacks so that 12Gb HBM2 memory usage may have many factors to account for Nvidia’s reasoning on HBM2 amounts/options.
Cool. Thanks for your
Cool. Thanks for your thoughts!
I’d be happy with a laptop
I’d be happy with a laptop Zen/Polaris based APU/Interposer based and an APU with just 2 HBM2 8-hi stacks would have 16GB of HBM2 Memory! So maybe a High End laptop APU on an interposer design with 2 stacks of HBM2 and more room for a little larger on interposer module Polaris GPU die, and a system with no need for an discrete GPU. With the mainboard much smaller to allow for a better cooling solution.
What AMD/JEDEC need to do is Approach Micron/others about creating an HBM/NVM standard that integrates some NVM NAND or NVM XPoint into the HBM stacks, with some form of extra fast in memory NVM/XPoint/Other stores for mobile systems. XPoint is a very dense BULK Memory NVM option so I’d imagine that many more GBs of XPoint could be added to the HBM stacks and be wired up by TSVs to allow for whole blocks of memory transfers between an XPoint die and the other HBM DRAM dies on the HBM’s stack a la the TSVs.
Just imagine a Game able to store its textures and other Code/Data right in an NVM store in the HBM’s stack/s with the HBM’s bottom logic die modified to have its own NVM controller that could in the background transfer textures/other information directly from NVM/XPoint to the HBM’s DRAM Dies by the block right through a TSV at the GPU’s/CPU’s/OS’s request.
The most plausible
The most plausible explanation for 3072b/12GB parts is a significant failure rate in interposer assembly.
TSV assembly challenges are the primary reason that HBM2 availability delayed Vega and GP102(?) to next year, so it shouldn’t be a shock that Nvidia’s first go at stacked memory might have struggles at the package level.