So… this is probably not for your home.
NVIDIA has just announced their latest pre-built system for enterprise customers: the DGX-2. In it, sixteen Volta-based Tesla V100 graphics devices are connected using NVSwitch. This allows groups of graphics cards to communicate to and from every other group at 300GB/s, which, to give a sense of scale, is about as much bandwidth as the GTX 1080 has available to communicate with its own VRAM. NVSwitch treats all 512GB as a unified memory space, too, which means that the developer doesn’t need redundant copies across multiple boards just so it can be seen by the target GPU.
Note: 512GB is 16 x 32GB. This is not a typo. 32GB Tesla V100s are now available.
For a little recap, Tesla V100 cards run a Volta-based GV100 GPU, which has 5120 CUDA cores and runs them at ~15 TeraFLOPs of 32-bit performance. Each of these cores also scale exactly to FP64 and FP16, as was the case since Pascal’s high-end offering, leading to ~7.5 TeraFLOPs of 64-bit or ~30 TeraFLOPs of 16-bit computational throughput. Multiply that by sixteen and you get 480 TeraFLOPs of FP16, 240 TeraFLOPs of FP32, or 120 TeraFLOPs of FP64 performance for the whole system. If you count the tensor units, then we’re just under 2 PetaFlops of tensor instructions. This is powered by a pair of Xeon Platinum CPUs (Skylake) and backed by 1.5TB of system RAM – which is only 3x the amount of RAM that the GPUs have if you stop and think about it.
The device communicates with the outside world through eight EDR InfiniBand NICs. NVIDIA claims that this yields 1600 gigabits of bi-directional bandwidth. Given how much data this device is crunching, it makes sense to keep data flowing in and out as fast as possible, especially for real-time applications. While the Xeons are fast and have many cores, I’m curious to see how much overhead the networking adds to the system when under full load, minus any actual processing.
NVIDIA’s DGX-2 is expected to ship in Q3.