93% of a GP100 at least…
Big Pascal finally embraces FP64 performance!
NVIDIA has announced the Tesla P100, the company's newest (and most powerful) accelerator for HPC. Based on the Pascal GP100 GPU, the Tesla P100 is built on 16nm FinFET and uses HBM2.
NVIDIA provided a comparison table, which we added what we know about a full GP100 to:
|Tesla K40||Tesla M40||Tesla P100||Full GP100|
|GPU||GK110 (Kepler)||GM200 (Maxwell)||GP100 (Pascal)||GP100 (Pascal)|
|FP32 CUDA Cores / SM||192||128||64||64|
|FP32 CUDA Cores / GPU||2880||3072||3584||3840|
|FP64 CUDA Cores / SM||64||4||32||32|
|FP64 CUDA Cores / GPU||960||96||1792||1920|
|Base Clock||745 MHz||948 MHz||1328 MHz||TBD|
|GPU Boost Clock||810/875 MHz||1114 MHz||1480 MHz||TBD|
|Memory Interface||384-bit GDDR5||384-bit GDDR5||4096-bit HBM2||4096-bit HBM2|
|Memory Size||Up to 12 GB||Up to 24 GB||16 GB||TBD|
|L2 Cache Size||1536 KB||3072 KB||4096 KB||TBD|
|Register File Size / SM||256 KB||256 KB||256 KB||256 KB|
|Register File Size / GPU||3840 KB||6144 KB||14336 KB||15360 KB|
|TDP||235 W||250 W||300 W||TBD|
|Transistors||7.1 billion||8 billion||15.3 billion||15.3 billion|
|GPU Die Size||551 mm2||601 mm2||610 mm2||610mm2|
|Manufacturing Process||28 nm||28 nm||16 nm||16nm|
This table is designed for developers that are interested in GPU compute, so a few variables (like ROPs) are still unknown, but it still gives us a huge insight into the “big Pascal” architecture. The jump to 16nm allows for about twice the number of transistors, 15.3 billion, up from 8 billion with GM200, with roughly the same die area, 610 mm2, up from 601 mm2.
A full GP100 processor will have 60 shader modules, compared to GM200's 24, although Pascal stores half of the shaders per SM. The GP100 part that is listed in the table above is actually partially disabled, cutting off four of the sixty total. This leads to 3584 single-precision (32-bit) CUDA cores, which is up from 3072 in GM200. (The full GP100 architecture will have 3840 of these FP32 CUDA cores -- but we don't know when or where we'll see that.) The base clock is also significantly higher than Maxwell, 1328 MHz versus ~1000 MHz for the Titan X and 980 Ti, although Ryan has overclocked those GPUs to ~1390 MHz with relative ease. This is interesting, because even though 10.6 TeraFLOPs is amazing, it's only about 20% more than what GM200 could pull off with an overclock.
Pascal's advantage is that these shaders are significantly more complex. First, double-precision performance is finally at a 1:2 ratio with single-precision, which is the highest proportion for both to be first-class citizens. (You can compute two, 32-bit values for each 64-bit one with enough parallelism in your calculations.) This yields a double-precision performance of 5.3 TeraFLOPs at stock clocks, and with just 56 operational SMs, for GP100. Compare this to GK110's 1.7 TeraFLOPs, or Maxwell's 0.2 (yes, 0.2) TeraFLOPs, and you'll see what a huge upgrade this is in calculations that need extra precision (or range).
Second, NVIDIA has also added FP16 values as a first-class citizen too, yielding a 2:1 performance ratio with FP32. This means that, in situations where 16-bit values are sufficient, you can get a full, 2x speed-up by dropping to 16-bit. GP100, with 56 SMs enabled, will have a peak performance of 21.2 TeraFLOPs.
You can multiply by 60/56 to see what the full GP100 processor could be capable of, but we're not going to do that here. The reason why: FLOP rating is also dependent upon the clock rate. If GP100's 1328 MHz (1480 MHz boost) is conservative, as we found on GM200, then this rate could get much higher. Alternatively, if NVIDIA is cherry-picking the heck out of GP100 for Tesla P100, the full chip might be slower. That said, enterprise components are usually clocked lower than gaming ones, for consistency in performance and heat management, so I'd guess that the number might actually go up.
Third, yes this list is continuing, there is a whole lot more memory performance. GP100 increases the L2 Cache from 3MB with GM100 to 4MB with GP100. Since Maxwell, NVIDIA can disable L2 Cache blocks (remember the 970?) so we're not sure if this is its final amount, but I expect that it will be. 4MB is a nice, round number, and I doubt they would mess with the memory access patterns of a professional GPU for scientific applications.
They also introduced this little thing called "HBM2" that seems to be making waves. While it will not achieve the 1TB/s bandwidth that was rumored, at least not in the 16GB variant announced today, 720 GB/s is nothing to sneer at. This is a little more than double what the Titan X can do, and it should be lower latency as well. While NVIDIA hasn't mentioned this, lower latency means that a global memory access should take fewer cycles to complete, reducing the stall in large tasks, like drawing complex 3D materials. That said, GPUs already have clever ways of overcoming this issue, such as parking shaders mid-execution when they hit a global memory access, letting another shader do its thing, then returning to the original task when the needed data is available. HBM2 also supports ECC natively, which allows error correction to be enabled without losing capacity or bandwidth. It's unclear whether consumer products will have ECC, too.
Pascal also introduces two new features: NVLink and Unified Memory. NVLink is useful for multiple GPUs on an HPC cluster, allowing them to communicate at a much higher bandwidth. NVIDIA claims that Tesla P100 will support four "Links", yielding 160 GB/s in both directions. For comparison, that is about half of the bandwidth of Titan X's GDDR5, which is right there on the card beside it. This also plays in with Unified Memory, which allows the CPU to share memory space with the GPU. Developers could write serial code that, without performing a copy, can be modified by a GPU for a burst of highly-parallel acceleration.
Where can you find this GPU? Well, let's hear what Josh has to say about it on the next page.