Remember last month? Remember when I said that Google’s introduction of Tesla P100s would be good leverage over Amazon, as the latter is still back in the Kepler days (because Maxwell was 32-bit focused)?
Amazon has leapfrogged them by introducing Volta-based V100 GPUs.
To compare the two parts, the Tesla P100 has 3584 CUDA cores, yielding just under 10 TFLOPs of single-precision performance. The Tesla V100, with its ridiculous die size, pushes that up over 14 TFLOPs. Same as Pascal, they also support full 1:2:4 FP64:FP32:FP16 performance scaling. It also has access to NVIDIA’s tensor cores, which are specialized for 16-bit, 4×4 multiply-add matrix operations that are apparently common in neural networks, both training and inferencing.
Amazon allows up to eight of them at once (with their P3.16xlarge instances).
So that’s cool. While Google has again been quickly leapfrogged by Amazon, it’s good to see NVIDIA getting wins in multiple cloud providers. This keeps money rolling in that will fund new chip designs for all the other segments.
“Same as Pascal, they also
“Same as Pascal, they also support full 1:2:4 FP64:FP32:FP16 performance scaling”
That a 1/2 DP rate of FP 32(12 TFLOP/s for P100 according to TechPowerUp) and P100 only does 4 TFlops of DP so that’s 1:3 FP64:FP32.
These figures are they peak boost clock numbers and the online wikipedia DP documentation is lacking for some Nvidia SKUs, including not much info of the P100 full die complement of FP HP/HP/SP resources. Wikipeda has most of the info on the shipping product based on Pascal/other Nvidia micro-archs but the P100 and V100 base die info is not fully complete.
“NVIDIA GP100 Silicon to Feature 4 TFLOPs DPFP Performance”
https://www.techpowerup.com/220135/nvidia-gp100-silicon-to-feature-4-tflops-dpfp-performance
Edit: HP/HP/SP
to: HP/SP/DP
Edit: HP/HP/SP
to: HP/SP/DP
???
Both the Tesla P100 &
???
Both the Tesla P100 & V100 have 1:2:4 FP64:FP32:FP16 performance.
The largest Tesla P100 (with NVLink) does 9.5 TFLOPs of FP32 and 4.75 TFLOPs of FP64. The largest Tesla V100 does 14.9 TFLOPs FP32 and 7.45 TFLOPs FP64.
PCPer P100: https://www.pcper.com/reviews/Graphics-Cards/NVIDIA-Announces-Tesla-P100-GP100-Reveal
PCPer V100: https://www.pcper.com/news/Graphics-Cards/NVIDIA-Announces-Tesla-V100-Volta-GPU-GTC-2017
Wikipedia: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Tesla
Every 64 bits of register memory in the compute core is hooked up to the logic required to do either 4x 16-bit operations, 2x 32-bit operations, or 1x 64-bit operation. They don't throw in extra registers that are only hooked up to 32-bit logic to get more FP32 performance in less die space (like 31 out of every 32 32-bit register on GM200).
From Nvidia’s Tesla P100
From Nvidia’s Tesla P100 whitepaper:
“Tesla P100 was built to deliver exceptional performance for the most demanding compute applications,
delivering:
. 5.3 TFLOPS of double precision floating point (FP64) performance
. 10.6 TFLOPS of single precision (FP32) performance
. 21.2 TFLOPS of half-precision (FP16) performance”
(1)[See page 6 of the whitepaper/PDF]
Damn Wikipedia is as is most of the other online sources not fully documenting the base GPU micro-arch GP100/etc. base die specifications only the consumer variants mostly with the quadro/GP102 documentation not fully done also. At least Nvidia publishes proper whitepapers on its SKUs and I should have looked there instead of other sources.
(1)
“Whitepaper
NVIDIA Tesla P100
The Most Advanced Datacenter Accelerator Ever Built
Featuring Pascal GP100, the World’s Fastest GPU ”
https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
P.S. On the whitepaper listed
P.S. On the whitepaper listed above page 11 has a very detailed chart comparing Pascal/P100 with the two previous generations of Nvidia GPU micro-archs. So on that chart there is a note that the GFlops figures in the chart/paper where based on boost clock numbers.
That’s one nice chart for the details listed that wort a bookmark for future refrence.
I’m not really sure what
I'm not really sure what you're saying. Page 6 of the white paper says that FP16 on P100 is 3x faster than FP32 on M40. Page 11 says that the number of FP64 "cores" is exactly half the FP32 count. (They're the same registers, just some operation takes twice the memory to operate on, so you can only do half as much.)
Sorry if I'm misunderstanding what you're saying.
I’m saying is there appears
I’m saying is there appears to be different numbers all across the web for GPU FP capability and not one definitive source. So the Nvidia whitepaper is more of a definitive source because I no longer trust any web based sources that are not directly from the GPUs maker(published Whitepapers/Professional Journals). So yes that 1/2 DP FP rate of yours is correct is what I’m saying it’s correct but your numbers are different than Nvidia’s boost numbers in the white paper.
But look at all of the other figures that do not match what Nvidia states in their white paper on GP100. And even there the base clock FP numbers are not listed but at least there is the Boost clock numbers in that table and a lot of other nice info that really should be included somewhere online in an easily searchable form. It would be less confusing if the reporters would state base or boost clock figures when they report on any GPUs FP/Flops numbers so readers could at least know that often ommitted fact when a GPUs FP metrics where touted.
Your figures: “The largest Tesla P100 (with NVLink) does 9.5 TFLOPs of FP32 and 4.75 TFLOPs” is that based on the base clock? And if so the Nvidia white paper figures for GP100(Boost Clock) are:
”
. 5.3 TFLOPS of double precision floating point (FP64) performance
. 10.6 TFLOPS of single precision (FP32) performance
. 21.2 TFLOPS of half-precision (FP16) performance”
So that 1/2 rate is true, but using different FP numbers.
and the figures from TechPowerUp’s article are just estimates in that article and even TechPowerUp’s GPU database is not listing GP100’s 16 bit HP FP or 64 bit DP FP numbers, but lists the SP FP(10,329 GFLOPS) number only and maybe that’s a base clock number.
So from now on I’ll have to look for any other Nvidia whitepapers that list Base and Boost clock figures on the Base P100 die design, or V100 Base die design, to try and build up a refrence of as near to accurate figures with Nvidia whitepaper results as refrences. And if I can not get it there I’ll look for other sorces for base and boost clock figures but not the figures from any enthusiasts websites.
I’m no longer trusting any enthusiasts websites or any enthusiasts website’s GPU databases because there is information missing and when articles are published on enthusiasts websites there is no direct refrences to where the reporters got their figures(FP/other) in the first place(Anazon’s figures have less weight than Nvidia’s whitepaper figures).
So what I’m getting at is I do not trust your figures or Amazom’s figures and I’ll trust the Nvidia whitepaper’s figures for the boost clock figures on P100, and hopefully there will be some other Nvidia whitepapers that have any base clock figures for P100.
Now on to looking for any Nvidia V100 whitepapers for some figures on that GPU micro-arch, as I’ll only trust what I can find from Nvidia’s whitepapers first and if I can not get a more complete pitcure there I’ll search the academic research paper sources(Peer Reviewed academic Journals and such). The enthusiasts websites are not a reliable source to be trusted.