Performance not two-die four.

How do you get past reticle limits? Split into multiple dies… maybe?

When designing an integrated circuit, you are attempting to fit as much complexity as possible within your budget of space, power, and so forth. One harsh limitation for GPUs is that, while your workloads could theoretically benefit from more and more processing units, the number of usable chips from a batch shrinks as designs grow, and the reticle limit of a fab’s manufacturing node is basically a brick wall.

What’s one way around it? Split your design across multiple dies!

NVIDIA published a research paper discussing just that. In their diagram, they show two examples. In the first diagram, the GPU is a single, typical die that’s surrounded by four stacks of HBM, like GP100; the second configuration breaks the GPU into five dies, four GPU modules and an I/O controller, with each GPU module attached to a pair of HBM stacks.

NVIDIA ran simulations to determine how this chip would perform, and, in various workloads, they found that it out-performed the largest possible single-chip GPU by about 45.5%. They scaled up the single-chip design until it had the same amount of compute units as the multi-die design, even though this wouldn’t work in the real world because no fab could actual lithograph it. Regardless, that hypothetical, impossible design was only ~10% faster than the actually-possible multi-chip one, showing that the overhead of splitting the design is only around that much, according to their simulation. It was also faster than the multi-card equivalent by 26.8%.

While NVIDIA’s simulations, run on 48 different benchmarks, have accounted for this, I still can’t visualize how this would work in an automated way. I don’t know how the design would automatically account for fetching data that’s associated with other GPU modules, as this would probably be a huge stall. That said, they spent quite a bit of time discussing how much bandwidth is required within the package, and figures of 768 GB/s to 3TB/s were mentioned, so it’s possible that it’s just the same tricks as fetching from global memory. The paper touches on the topic several times, but I didn’t really see anything explicit about what they were doing.

If you’ve been following the site over the last couple of months, you’ll note that this is basically the same as AMD is doing with Threadripper and EPYC. The main difference is that CPU cores are isolated, so sharing data between them is explicit. In fact, when that product was announced, I thought, “Huh, that would be cool for GPUs. I wonder if it’s possible, or if it would just end up being Crossfire / SLI.”

Apparently not? It should be possible?

I should note that I doubt this will be relevant for consumers. The GPU is the most expensive part of a graphics card. While the thought of four GP102-level chips working together sounds great for 4K (which is 4x1080p in resolution) gaming, quadrupling the expensive part sounds like a giant price-tag. That said, the market of GP100 (and the upcoming GV100) would pay five-plus digits for the absolute fastest compute device for deep-learning, scientific research, and so forth.

The only way I could see this working for gamers is if NVIDIA finds the sweet-spot for performance-to-yield (for a given node and time) and they scale their product stack with multiples of that. In that case, it might be cost-advantageous to hit some level of performance, versus trying to do it with a single, giant chip.

This is just my speculation, however. It’ll be interesting to see where this goes, whenever it does.