Looking Towards the Professionals
We now dive into the compute performance side of the TITAN V, where it real shines.
This is a multi-part story for the NVIDIA Titan V:
Earlier this week we dove into the new NVIDIA Titan V graphics card and looked at its performacne from a gaming perspective. Our conclusions were more or less what we expected – the card was on average ~20% faster than the Titan Xp and about ~80% faster than the GeForce GTX 1080. But with that $3000 price tag, the Titan V isn't going to win any enthusiasts over.
What the Titan V is meant for in reality is the compute space. Developers, coders, engineers, and professionals that use GPU hardware for research, for profit, or for both. In that case, $2999 for the Titan V is simply an investment that needs to show value in select workloads. And though $3000 is still a lot of money, keep in mind that the NVIDIA Quadro GP100, the most recent part with full-performance double precision compute from the Pascal chip, is still selling for well over $6000 today.
The Volta GV100 GPU offers 1:2 double precision performance, equating to 2560 FP64 cores. That is a HUGE leap over the GP102 GPU used on the Titan Xp that uses a 1:32 ratio, giving us just 120 FP64 cores equivalent.
|Titan V||Titan Xp||GTX 1080 Ti||GTX 1080||GTX 1070 Ti||GTX 1070||RX Vega 64 Liquid||Vega Frontier Edition|
|Base Clock||1200 MHz||1480 MHz||1480 MHz||1607 MHz||1607 MHz||1506 MHz||1406 MHz||1382 MHz|
|Boost Clock||1455 MHz||1582 MHz||1582 MHz||1733 MHz||1683 MHz||1683 MHz||1677 MHz||1600 MHz|
|Memory Clock||1700 MHz MHz||11400 MHz||11000 MHz||10000 MHz||8000 MHz||8000 MHz||1890 MHz||1890 MHz|
|384-bit G5X||352-bit G5X||256-bit G5X||256-bit||256-bit||2048-bit HBM2||2048-bit HBM2|
|Memory Bandwidth||653 GB/s||547 GB/s||484 GB/s||320 GB/s||256 GB/s||256 GB/s||484 GB/s||484 GB/s|
|TDP||250 watts||250 watts||250 watts||180 watts||180 watts||150 watts||345 watts||300 watts|
|Peak Compute||12.2 (base) TFLOPS
14.9 (boost) TFLOPS
|12.1 TFLOPS||11.3 TFLOPS||8.2 TFLOPS||7.8 TFLOPS||5.7 TFLOPS||13.7 TFLOPS||13.1 TFLOPS|
|Peak DP Compute||6.1 (base) TFLOPS
7.45 (boost) TFLOPS
|0.37 TFLOPS||0.35 TFLOPS||0.25 TFLOPS||0.24 TFLOPS||0.17 TFLOPS||0.85 TFLOPS||0.81 TFLOPS|
The current AMD Radeon RX Vega 64, and the Vega Frontier Edition, all ship with a 1:16 FP64 ratio, giving us the equivalent of 256 DP cores per card.
Test Setup and Benchmarks
Our testing setup remains the same from our gaming tests, but obviously the software stack is quite different.
|PC Perspective GPU Testbed|
|Processor||Intel Core i7-5960X Haswell-E|
|Motherboard||ASUS Rampage V Extreme X99|
|Memory||G.Skill Ripjaws 16GB DDR4-3200|
|Storage||OCZ Agility 4 256GB (OS)
Adata SP610 500GB (games)
|Power Supply||Corsair AX1500i 1500 watt|
|OS||Windows 10 x64|
Applications in use include:
- Cinebench R15
- Sisoft Sandra GPU Compute
- SPECviewperf 12.1
Let's not drag this along – I know you are hungry for results! (Thanks to Ken for running most of these tests for us!!)
Are the current benchmarks
Are the current benchmarks using much of the Tensor Cores yet? I’m seeing one sort of AI usage for graphics software(1) so Maybe those Tensor Cores ca have graphics application usage. I’d Like to Know if there are any graphics filter plugins that may make us of the Tensor Core’s Matrix Math functionality on those Tensor Cores in a non AI related way also the same way that a GPU’s shader cores can be used for other compute related usage. Because those Tensor Cores have other than AI usage if they are good at Matrix math, as lots of current graphics application filtering make us of Matrix math, and Tensor Cores are just didicated hardware for accelerating Matrix/Tensor(2d, 3d Matrix structures) Calculations.
“Adobe previews new Photoshop feature that uses AI to select subjects”
What is the Direct3d 12
What is the Direct3d 12 feature set for Volta, is it still the same as Pascal or full?
Could you run feature set checker for it:
I don’t know where you got
I don’t know where you got your Vega Frontier Viewperf scores but these mine, they aren’t even my best.
3Dsmax-05 – 151.01
catia-04 – 141.85
creo-01 – 90.74
energy-01 – 21.24
maya-04 – 108.80
medical-01 – 101.08
showcase-01 = 127
snx-02 – 164.98
sw-03 – 118.05
They should be in the Viewperf database
They might not have a Vega
They might not have a Vega Frontier card anymore, so their results are from 6 months ago. I imagine drivers would have improved the numbers a bit since then.
Ryan’s original Vega FE review, with viewperf numbers that exactly match the ones quoted in this article:
When/where do you mention
When/where do you mention this a 12nm part???
The foundry claims that 12FDX can deliver “15 percent more performance over current FinFET technologies” with “50 percent lower power consumption,” at a cost lower than existing 16nm FinFET devices. http://www.fudzilla.com/news/processors/42258-tsmc-preparing-12nm-process-technology
ignore last comment that’s
ignore last comment that’s GloFo 12nm spec not TSMC
Although TSMC’s 12nm process was originally planned to be introduced as a fourth-generation 16nm optimization, it will now be introduced as an independent process technology instead. Three of the company’s partners have already received tape-outs on 10nm designs and the process is expected to start generating revenues by early 2017. Apple and MediaTek are likely to be the first with 10nm TSMC-based products, while the 12nm node should become a useful enhancement to fill the competition gap before more partners are capable of building 10nm chips.
I assume you meant early
I assume you meant early 2018, not 2017 – right?
Titan V is a 12nm part
Titan V is a 12nm part
But Volta has the on that
But Volta has the on that 12nm process the same number of ROPs(1) as the GP102 based top end variants 96 and ROPs. So most of Volta’s performance gains in gaming can be attributed to process node shrink/tweaks and Volta/GV100’s larger L2 cache etc. And the GPU makers need to stop treating their ROP/TMU units as magic black boxes and start providing more information.
I’m suspecting that both AMD’s and Nvidia’s ROP technology does not change much “Generation” to “Generation” and that its mostly the shader to ROP to TMU unit ratios that are what is giving the better gaming performance and that AMD just needs to take The Vega Micro-Arch base die design and refactor the Shader to TMU to ROP ratios more towards gaming workloads. And really it’s that more ROPs figure that is most responsible for Nvidia’s Better FPS Metrics relative to AMD’s current Vega 10 base die design/blueprints.
Nvidia has all of its many base GPU designs/blueprints with GP100/GV100 being more compute heavy and GP102/GV102 likewise but with a little less shader resources for professional graphcs workloads. GP102/GV100 dies have the most ROPs available with the GP104/GV104 dies starting out the gaming focused SKUs with their Shader counts really trimed back on down to the GP106-108/GV106-108 SKUs that even have less resources and much narrower busses.
The GPUs micro-arch do not play much of any role in the matter as it’s the execution resources on the Shaders and ROP’s, TMUs and any other hardware functionality tuned for graphics, raster and geometry and trigonometry, workloads. The ROP’s are where it all comes together to be put togather and rendered and that’s Nvidia’s strong point in that Nvidia uses more ROPs and keeps some base GPU die designs in the hold to bring out for gaming usage if AMD’s offerings start getting a little bit too close in gaming performance.
AMD has only one base die design for Desktop and Porfessional compute with the Vega 10 die/blueprints so AMD has one and Nvidia has many and AMD’s Vega 10 has no extra ROPs to speak of. Nvidia has loads of die designs with way more ROPs to be made available like when Nvidia took the GP102 die/blueprints and those 96 Available ROPs and spun out a GP102 based GTX 1080Ti(88 ROPs) because the Vega 10 die variants(Vega 64/56, both have 64 ROPs) were getting very good in competing with the GTX 1080/GTX 1070 that likewise have 64 ROPs. AMD has no extra base Vega die variants with different ratios of Shaders to TMUs to ROPs and even though Vega has way more TMU resources compared to Pascal it’s those extra ROPs that Nvidia can bring out that keeps Nvidia in the FPS lead and Nvidia does have higher clocks also.
AMD’s high power usage has less to do with its GPU micro-archs and more to do with having only one compute heavy base GPU design with loads of power hungry shaders to work with that has to be tuned for compute more than gaming. So Vega 10’s ROPs available are minimal and more shaders are substituted because of compute reguirements of the professional markets.
According to Wikipedia:
“The render output unit, often abbreviated as “ROP”, and sometimes called (perhaps more properly) raster operations pipeline, is a hardware component in modern graphics processing units (GPUs) and one of the final steps in the rendering process of modern graphics cards. The pixel pipelines take pixel (each pixel is a dimensionless point), and texel information and process it, via specific matrix and vector operations, into a final pixel or depth value. This process is called rasterization. So ROPs control antialiasing, when more than one sample is merged into one pixel. The ROPs perform the transactions between the relevant buffers in the local memory – this includes writing or reading values, as well as blending them together. Dedicated antialiasing hardware used to perform hardware-based antialiasing methods like MSAA is contained in ROPs.
” All data rendered has to travel through the ROP in order to be written to the framebuffer, from there it can be transmitted to the display.
Therefore, the ROP is where the GPU’s output is assembled into a bitmapped image ready for display.
Historically the number of ROPs, TMUs, and shader processing units/stream processors have been equal. However, from 2004, several GPUs have decoupled these areas to allow optimum transistor allocation for application workload and available memory performance. As the trend continues, it is expected that graphics processors will continue to decouple the various parts of their architectures to enhance their adaptability to future graphics applications. This design also allows chip makers to build a modular line-up, where the top-end GPUs are essentially using the same logic as the low-end products.” (1)
“Render output unit”(raster operations pipeline)
NVIDIA Volta Allegedly
NVIDIA Volta Allegedly Launching In 2017 On 12nm FinFET Technology
by Ryan Smith May 10, 2017
NVIDIA Volta Unveiled: GV100 GPU and Tesla V100 Accelerator Announced https://www.anandtech.com/show/11367/nvidia-volta-unveiled-gv100-gpu-and-tesla-v100-accelerator-announced
‘But starting with the raw specficiations, the GV100 is something I can honestly say is a audacious GPU, an adjective I’ve never had a need to attach to any other GPU in the last 10 years. In terms of die size and transistor count, NVIDIA is genuinely building the biggest GPU they can get away with: 21.1 billion transistors, at a massive 815mm2, built on TSMC’s still green 12nm “FFN” process (the ‘n’ stands for NVIDIA; it’s a customized higher perf version of 12nm for NVIDIA).’
Yes it’s big but what
Yes it’s big but what percentage of 815mm2 is usable for gaming focused workloads. And Nvidia has no extra ROP counts above 96 on Volta GV100/GV102 relative to GP100/GP102.
So the Volta GV104 variants(GTX ##80/GTX ##70) better start out with more than 64 ROPs(GTX 1080/1070) and up to 88 ROPs or Nvidia’s previous generation SKUs with 88 ROPs(GTX 1080Ti/GP102 based) will compete very well with Volta/GV104. GV104 is not going to have that large L2 cache that GV100/102 may have available and GV104 has to get the shader counts trimmed or power usage is going to be higher. GV100 actually has some TMU resources in its design to top AMD’s TMU heavy Vega designs. But Volta’s top allotment of ROPs has not changed so maybe AMD can get its ROP counts up to 96 on some new base die variant that makes use of the Vega GPU micro-arch with just the right amount of shaders for gaming focused workloads.
ROPs count is tied to L2
ROPs count is tied to L2 cache and memory busses, Titan V is neutered by one channel and it has less L2 than full GV100(4.5M vs 6M) so it’s safe to assume full gv100 has rop count of 128 or in other words 32 per hbm2 chip. In that way it sounds similar to pascal gp100.
ROP count for their Gddr cards has been 8 Rops per 32bit channel, which I don’t believe will change. That would make 32 for 128bit card(~gv107), 48 for 192bit card(~gv106) and 64 for 256bit card(~gv104) and 96 for 384bit card(~gv102).
Yes but Titan V(at 4.5 MB L2)
Yes but Titan V(at 4.5 MB L2) has 1.5MB more L2 cache than Titan Xp with both having 96 ROPs available and even missing that one HBM2 stack there is still plenty of extra bandwidth coming from the 3 remaining HBM2 stacks.
Titan V’s Gaming Performance is not that much higher than Titan Xp’s and the Volta based GTX 1180/2080(Whatever numbering they use) is going need to be as fast at gaming as the GTX 1080Ti to give Nvidia users a reason to upgrade. Nvidia is like Intel in that Nvidia is competing with itself and AMD and Nvidia’s Volta GPUs need to outperform Nvidia’s Pascal offerings and outperform AMD Vega offerings or Nvidia’s customer base will not want to update to the latest. So if any Vega refresh gets Vega 64 closer to the GTX 1080Ti then the Volta GTX 1180/2080(Whatever) will have to compete with Vega and Pascal(GTX 1080Ti) so 88(GTX 1080Ti’s ROP count) or more ROPs on the Volta GTX 1180/2080(Whatever) GV104 variants.
From Nvidia development Blog:
“Similar to the previous generation Pascal GP100 GPU, the GV100 GPU is composed of multiple Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GV100 GPU consists of six GPCs, 84 Volta SMs, 42 TPCs (each including two SMs), and eight 512-bit memory controllers (4096 bits total). Each SM has 64 FP32 Cores, 64 INT32 Cores, 32 FP64 Cores, and 8 new Tensor Cores. Each SM also includes four texture units.
Figure 5: Volta GV100 Full GPU with 84 SM Units.
Figure 4: Volta GV100 Full GPU with 84 SM Units.
With 84 SMs, a full GV100 GPU has a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units. Each memory controller is attached to 768 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory controllers. The full GV100 GPU includes a total of 6144 KB of L2 cache. Figure 4 shows a full GV100 GPU with 84 SMs (different products can use different configurations of GV100). The Tesla V100 accelerator uses 80 SMs.” (1)
So for Tesla V100:
FP32 Cores / SM______________________64
FP32 Cores / GPU_____________________5120
FP64 Cores / SM______________________32
FP64 Cores / GPU_____________________2560
Tensor Cores / SM____________________8
Tensor Cores / GPU___________________640
GPU Boost Clock______________________1530 MHz
Peak FP32 TFLOP/s*___________________15.7
Peak FP64 TFLOP/s*___________________7.8
Peak Tensor Core TFLOP/s*____________125
Memory Interface_____________________4096-bit HBM2
Memory Size__________________________16 GB
L2 Cache Size________________________6144 KB
Shared Memory Size / SM______________Configurable up to 96 KB
Register File Size / SM______________256KB
Register File Size / GPU_____________20480 KB
GPU Die Size_________________________815 mm²
Manufacturing Process________________12 nm FFN
* boost clock
“Inside Volta: The World’s Most Advanced Data Center GPU”
Is there a good overview
Is there a good overview somewhere about what these compute tests rely on or which strengths-weaknesses of GPUs they expose?
There are some results that do not sound very right. Just as an example:
– Some of the ties with Titan Xp are suspicious.
– SiSoft Sandra 2017 GPGPU Image Processing test is said to do well on Vega due to FP16. Volta is definitely claimed to have the same double rate of FP16 compared to FP32 so is that not exposed in API or does the benchmark not implement that?
Volta also has those tensor
Volta also has those tensor cores that are just matrix math units so that’s a whole lot of 16 bit math via matirx operations that can be made to do direct image processing workloads and via the TensorFlow libraries do AI accelerated image processing workloads.
Vega can make use of the same TensorFlow libraries but Vega has no Tensor Cores to accelerate matrix math operations like Volta currently has.
So that Adobe/other software can use Volta’s tensor cores for AI and identify people in an image/video and automatically mask out the background via AI TensorFlow library functions or graphics software can make use of the Tensor Cores for old fashon matrix math that is used by graphics applications for effects/filtering.
I’d expect that now that Volta/Titan V is available that plenty of developers are now using Titan V and also the Vega Founders edition SKUs which cost much less than the Titan V. The Vega FE can be purchased for around $730 to $850 on sale and 3 FEs pack pleny of FP16 and FP 32 performance for less than the price of the Titan V. And 4 Vega FEs at the $730 price point and plenty more FP 16 and FP 32 for a bit less than $3000(Titan V).
The TensorFlow libraries are what make for the AI and not the specific hardware and you could run AI/TensrFlow workloads on CPUs but that’s not very efficient compared to GPUs with their massively parallel shader cores or with Volta its Tensor Cores(matrix math cores). Hell those tensor cores on Volta probably would do great for accelerating spreadsheet wotkloads if the numbers are not too large.
The benchmarks are always behind the hardware anyways especially any just relesased hardware. The Graphics software ecosystem take months to cath up and the tweaking never ceases even after the next generation arrives if any more efficient way of doing the calculations is discoverd for older hardware.
The key high-cost operation
The key high-cost operation of statistical learning algorithms (aka “deep learning” if you like buzzwords) is optimization – i.e. finding the point a high-dimensional state space where a let’s-just-call-it-positive-valued “loss function” obtains its minimum value.
Long story short, this means either inverting high-dimensional matrices or performing clever and/or unnatural acts to avoid directly inverting those matrices (look up Newton’s method from Calc 101 to get an idea why).
These methods are quite sensitive to round-off error and are generally recommended to be performed in double.
The publicly available Nvidia documentation shows that the tensor cores principal function is an affine operation of multiplying a pair of 4-dimensional 16-bit vector-matrix elements, optionally adding a 4-D vector-matrix offset (sound familiar graphics people?) and storing the result in a 32-bit result (because multiplication).
This may be useful in the so-called “inference phase” of “deep learning”, however, “inference phase” is likewise a fairly recent buzzword in the field which very explicitly means the part of the process where no learning is taking place. Doing inference quickly is important for real-time controls such as driving a car, but in real time systems where an error can kill people, I would personally opt for avoiding roundoff error.
Furthermore, Deep Learning frameworks – particularly those pushed by hardware vendors – have had their evolution shaped by the hardware that runs them fastest, or, in the somewhat more sobering case, their evolution has been shaped by the hardware that is promoted by online courses in Deep Learning that treat learning as a grey-box to be implemented with frameworks.
What should be astronomically improved is operations that require double – i.e. learning phase. However that is not what you hear if you go poking around tests being done by the deep learning community. Google it for yourself.
Finally, there’s the question of why is it that the Xp can only do single floats while only the many-times more expensive V100, GP100, and TItan V can? Quadro users in the CG community have been complaining about similar discrepancies for years. They seem to have some well-substantiated theories regarding the cause of the inability of lower-priced cards to do 64 bit.
Bottom line, none of these benchmarks really have much to say about the performance of the Titan V relative to the bold and profound claims of its marketing materials.
I am fairly confident we will soon see benchmarks that show the value of FP64 in the Titan V, but the Tensor Cores will be more of a stretch. But the hardware vendors have recently been quite bullish on Deep Learning frameworks, and how they are a game changer that allows users to work at a high level without concerning themselves with what’s under the hood.
And neither hardware vendors nor cloud services providers are particularly motivated to steer users away from practices that cause them to buy the much-more-expensive hardware nor from practices that cause them to use twenty times the cloud compute time that they otherwise might.
Food for thought: as our friends The Lawyers like to rhetorically ask, “Cui Buono?”
Why was TITAN V’s driver
Why was TITAN V’s driver prohibited from using in the data center?
Because a data center is not
Because a data center is not a PC Bang…
(hint is in the driver name “Nvidia Geforce Software”)