So… this is probably not for your home.
NVIDIA has just announced their latest pre-built system for enterprise customers: the DGX-2. In it, sixteen Volta-based Tesla V100 graphics devices are connected using NVSwitch. This allows groups of graphics cards to communicate to and from every other group at 300GB/s, which, to give a sense of scale, is about as much bandwidth as the GTX 1080 has available to communicate with its own VRAM. NVSwitch treats all 512GB as a unified memory space, too, which means that the developer doesn’t need redundant copies across multiple boards just so it can be seen by the target GPU.
Note: 512GB is 16 x 32GB. This is not a typo. 32GB Tesla V100s are now available.
For a little recap, Tesla V100 cards run a Volta-based GV100 GPU, which has 5120 CUDA cores and runs them at ~15 TeraFLOPs of 32-bit performance. Each of these cores also scale exactly to FP64 and FP16, as was the case since Pascal’s high-end offering, leading to ~7.5 TeraFLOPs of 64-bit or ~30 TeraFLOPs of 16-bit computational throughput. Multiply that by sixteen and you get 480 TeraFLOPs of FP16, 240 TeraFLOPs of FP32, or 120 TeraFLOPs of FP64 performance for the whole system. If you count the tensor units, then we’re just under 2 PetaFlops of tensor instructions. This is powered by a pair of Xeon Platinum CPUs (Skylake) and backed by 1.5TB of system RAM – which is only 3x the amount of RAM that the GPUs have if you stop and think about it.
The device communicates with the outside world through eight EDR InfiniBand NICs. NVIDIA claims that this yields 1600 gigabits of bi-directional bandwidth. Given how much data this device is crunching, it makes sense to keep data flowing in and out as fast as possible, especially for real-time applications. While the Xeons are fast and have many cores, I’m curious to see how much overhead the networking adds to the system when under full load, minus any actual processing.
NVIDIA’s DGX-2 is expected to ship in Q3.
“I’m curious to see how much
“I’m curious to see how much overhead the networking adds to the system when under full load, minus any actual processing.”
Isnt the whole point of NVSwitch to offload and therefore eliminate or minimize network communication overhead? Similar to the Assistant Cores in Fujitsu PrimeHPC FX100s SPARC XIfx?
Wont this allow larger scalability with no apparent cost? Or am i misreading what NVSwitch does?
I should have added that all
I should have added that all this shared memory scale up should give SGI(im not defiling their memory by calling them HPE) a run for their money.
You just Know the folks that
You just Know the folks that maybe using these things are looking at things like GFlops/Watt(DP, SP, HP/tensor) and the total cost of ownership(TCO) amd that includes any up front hardware costs amortized over time.
Maybe the big trading houses will look if the latency is low enough to keep ahread on the trades and maybe others will look at these systems like the pharma folks. But for the majorty of the Cloud Services Businesses there may not be that much of a need for too many of these systems outside of some limited AI/Training sorts of usage. Gppgle’s Second generation of TPUs(64 4 chip modules are then assembled into 256 chip pods with 11.5 PFLOPS of performance) and their Gflops/Watt or TFlops/Watt and Cost/GFlops-TFlops of are also going to be considered. AMD’s GPU’s can also do the 16 bit(Full 16 bit math and not only just 16 bit Results) math but AMD does not have any tensor cores as of yet but their 7nm Vega 20 Products are expected Q3 2018 also.