How deep is your learning?
NVIDIA’s new Tensor Cores tested!
Recently, we've had some hands-on time with NVIDIA's new TITAN V graphics card. Equipped with the GV100 GPU, the TITAN V has shown us some impressive results in both gaming and GPGPU compute workloads.
However, one of the most interesting areas that NVIDIA has been touting for GV100 has been deep learning. With a 1.33x increase in single-precision FP32 compute over the Titan Xp, and the addition of specialized Tensor Cores for deep learning, the TITAN V is well positioned for deep learning workflows.
In mathematics, a tensor is a multi-dimensional array of numerical values with respect to a given basis. While we won't go deep into the math behind it, Tensors are a crucial data structure for deep learning applications.
NVIDIA's Tensor Cores aim to accelerate Tensor-based math by utilizing half-precision FP16 math in order to process both dimensions of a Tensor at the same time. The GV100 GPU contains 640 of these Tensor Cores to accelerate FP16 neural network training.
It's worth noting that these are not the first Tensor operation-specific hardware, with others such as Google developing hardware for these specific functions.
Test Setup
PC Perspective Deep Learning Testbed | |
---|---|
Processor | AMD Ryzen Threadripper 1920X |
Motherboard | GIGABYTE X399 AORUS Gaming 7 |
Memory | 64GB Corsair Vengeance RGB DDR4-3000 |
Storage | Samsung SSD 960 Pro 2TB |
Power Supply | Corsair AX1500i 1500 watt |
OS | Ubuntu 16.04.3 LTS |
Drivers | AMD: AMD GPU Pro 17.50 NVIDIA: 387.34 |
For our NVIDIA testing, we used the NVIDIA GPU Cloud 17.12 Docker containers for both TensorFlow and Caffe2 inside of our Ubuntu 16.04.3 host operating system.
AMD testing was done using the hiptensorflow port from the AMD ROCm GitHub repositories.
For all tests, we are using the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) data set.
TensorFlow
Originally based off of an Internal Google development product, TensorFlow is one of the most popular open source deep learning frameworks available to researchers. With GPU support written in CUDA, TensorFlow is a mature framework with support for many different deep learning models.
There are two key things to look for here as far as performance is concerned, batch size and the level of precision for the training model. Batch size allows more items to be passed into the model and processed at once. Batch size will allow for a faster total training time and will show more of a performance delta between devices, but is constrained by the memory available on your system.
The more important detail for this testing is the precision at which the model is trained. With Volta, the ability to train the network at FP16 (half-precision) is enabled. When training in FP16 mode, the specialized Tensor cores of the GV100 GPU are used. In the FP32 mode, the traditional CUDA cores are used for training this network. This provides us with a good comparison for how effective the Tensor cores are compared to traditional GPU stream processors
Please note with these results that the scores of 0 are either because the GPU does not support FP16 training (the Titan Xp and Vega Frontier Edition with the current software), or that the batch size was not supported with our 64GB of system memory.
Across the 3 different models we tested with the TensorFlow application, there are some very common performance traits to be observed. With traditional FP32 operations, the Titan V sees a 15-25% advantage in training over the last generation GP102-based Titan Xp.
The AMD GPU, however, is completely lacking in comparison to both the Titan V and the Titan Xp. This is likely due to the original TensorFlow application being written in the CUDA programming language, as opposed to OpenCL, which would run natively on AMD GPUs. Instead, AMD GPUs must run the hiptensorflow project, which consists of CUDA code converted to universal C++ code through their HIP converter. From the results, it's clear that this converter has a significant performance downsize, and anyone who is interested in training TensorFlow-powered neural networks should look elsewhere than AMD GPUs at the moment.
When taking FP16 into account, there are major performance benefits to the Tensor cores in the Titan V, ranging from 40-80%, and over twice the performance as the last generation Titan Xp running in FP32 mode.
More importantly, to FP16 also allows us to hit higher batch sizes, which means training an entire network will be even faster—an improvement of over 120% in our testing moving from the largest batch sizes we could hit on Titan Xp with FP32 to larger, FP16-based batches on Titan V.
(Editor's note: The AMD Vega architecture has support for "double packed math", essentially double performance in FP16 operations. While I'm not certain it's the case, it would seem likely that these deep learning workloads could see performance improvements if the capability is enabled in future software optimizations.)
Caffe2
In order to validate this performance, we also compared the Titan Xp and the Titan V on another deep learning staple, Caffe2.
Caffe2 is another popular open source deep learning framework developed by Facebook. For our testing, we are using the same ResNet-50 model as we used in some of our TensorFlow testing above.
With the ResNet-50 model, we see similar results from Caffe2 as we did in Tensorflow. FP32-based training sees a 16% increase going to the more powerful Titan V, but FP16 training provides an incredible 94% increase in processing between the Titan Xp (in FP32 mode) and the Titan V at the same batch size.
Much as we expected, NVIDIA's Titan V is the most powerful option for workstation-level deep learning.
Even at $3000, this card is a no-brainer for a researcher or scientist who is on a smaller scale than needing the $150,000 NVIDIA DGX-1 server with 8 V100 GPUs in it, but still wants the ability to iterate quickly on their models or train a multitude of large datasets.
It’s interesting to know that
It’s interesting to know that over at Anandtech they menitioned that the Volta based Titan V does not make use of superscalar execution so I guess that’s some new information.
”
1 New tensor cores
2 Removing the second warp scheduler dispatch unit & eliminating superscalar execution
3 Separating the Integer cores
4 Finer-grained thread scheduling ” (1)
So according to anandtech:
“The second big change that Volta brings to the table is that, at least for GV100, the second warp scheduler dispatch port has been eliminated. Ever since GF104 in 2011, NVIDIA’s architectures have featured two dispatch ports per warp scheduler, allowing for superscalar execution. In other words, their architecture has relied on a degree of instruction level parallelism, requiring the ability to execute a second, non-dependent instruction from a thread in order to get the most out of the hardware.
Volta/GV100, by contrast, is no longer superscalar. Each partition within an SM is now feed by a single dispatch unit warp scheduler, with no opportunity to extract ILP. This means that Volta is a pure thread level parallelism (TLP) design: max utilization comes from maximizing the number of threads active at any given time.
ILP versus TLP is a constant balance, and it’s not unusual to see NVIDIA shifting between the two, especially for a compute-centric GPU like GV100. ILP is nice to have, but extracting it can be difficult. On the other hand while GPUs are meant for embarrassingly parallel tasks, it’s not always easy to generate more threads. So there’s a very real question over whether the performance gains from adding the hardware for ILP justifies the power and complexity costs of doing so.” (1)
So on Titan V is does not have superscalar execution and that means that the application software has to be written to manage threading is such a way as to make the best use of the hardware. Also I’m looking at these 32 bit Vega FE numbers and doubling them as a rough estimate of the 16 bit performance but AMD does not have the Tensor cores to match Volta’s Tensor Operations per second metrics and there will be questions as to how will Nvidia will deal with any gaming workloads that may like that removed second scheduler included. As well as the Anandtech tables that only show Volta having 96 ROPs, is this just for Titan V, and what are the total ROP numbers available on GV100 as Titan V is based of the GV100 die and not any GV102 variant that has yet to be announced.
Anandtech’s article goes into greater detail on Volta’s New FP and Integer unit arrangement also so without superscalar execution the shader core utilization rates and Shader Core IPC is going to be different on Volta compared to Pascal.
(1)[See article page 2 titled: “The Volta Architecture: In Brief”]
“The NVIDIA Titan V Preview – Titanomachy: War of the Titans”
https://www.anandtech.com/show/12170/nvidia-titan-v-preview-titanomachy
387.34? Why not 388.71?
387.34? Why not 388.71?
387.34 is the latest release
387.34 is the latest release for Linux.
Is this card for processing
Is this card for processing what CGP Grey is talking about in this video? https://www.youtube.com/watch?v=R9OHn5ZF4Uo
It does what the footnote
It does what the footnote videos talks about https://www.youtube.com/watch?v=wvWpdrfoEv0
do we have any information
do we have any information about how much of the silicon die area is taken up with cuda cores vs Tensor cores? would be interesting to look at these results normalised against die area as well as power.
e.g. if they did a core with only tensor cores designed for AI/machine learning/self driving how would it perform & scale?
also wondering when they will have cuda cores that can be used together for half/single/double precision as needed to maximise flexibility and potential performance.
Dan
Tensor cores are not used for
Tensor cores are not used for training you are basically only seeing the gains of FP16.
It also seems like you have not built Tensorflow or Caffe2 which means you are not using cudnn 7 or CUDA 9.
To actually measure the performance of Volta you’ll have to build TF from source currently as there isn’t a binary release with CUDA 9 support yet and for inference testing run the trained CG through TensorRT 3.
The latest TensorFlow NGC
The latest TensorFlow NGC Docker containers are built with a later version of TF than is available as a binary download. These containers are built on CUDA 9 and explicitly support Volta natively.
NVIDIA has also confirmed to us that the Tensor Cores are in fact being used for training,
Why no cafe 2 for Vega? This
Why no cafe 2 for Vega? This is supported by Vega on Linux know, while tensorflow is still under dev for Vega.
As far as I can tell, there’s
As far as I can tell, there's no native OpenCL implementation of Caffe2. If you have a link, I'd love to take a look!
Maybe only Caffe (not 2) has
Maybe only Caffe (not 2) has reached “public” release according to
https://rocm.github.io/dl.html
And this is how Skynet is
And this is how Skynet is born.