The G80 Architecture
Well, we’ve talked about what a unified architecture is and how Microsoft is using it in DX10 with all the new features and options available to game designers. But just what does NVIDIA’s unified G80 architecture look like??
Click to Enlarge
All hail G80!! Well, um, okay. That’s a lot of pretty colors and boxes and lines and what not, but what does it all mean, and what has changed from the past? First, compared to the architecture of the G71 (GeForce 7900), which you can reference a block diagram of here, you’ll notice that there is one less “layer” of units to see and understand. Since we are moving from a dual-pipe architecture to a unified one, this makes sense. Those eight blocks of processing units there with the green and blue squares represent the unified architecture and work on pixel, vertex and geometry shading.
Even the setup events at the top of the design are completely new, from the Host and below. The top layer of the architecture that includes the “Vtx Thread issue, Geom Thread Issue and Pixel Thread Issue” units is part of the new thread processor and is responsible for maintaining the states of the numerous processing threads that are active at one time and assigning them to (or issuing) processing units as they are needed. With this many processing units and this many threads, this unit is going to keep quite busy…
Okay, so how many are there already?? There are 128 streaming processors that run at 1.35 GHz accepting dual issue MAD+MUL operations. These SPs (streaming processorss) are fully decoupled from the rest of the GPU design, are fully unified and offer exceptional branching performance (hmm…). The 1.35 GHz clock rate is independent of the rest of the GPU, though all 128 of the SPs are based off of the same 1.35 GHz clock generator; in fact you can even modify the clock rate on the SPs seperately from that of the GPU in the overclocking control panel! The new scalar architecture on the SPs benefits longer shader applicaitons to be more efficient when compared to the vector architecture of the G70 and all previous NVIDIA designs.
The L1 cache shown in the diagram is shared between 16 SPs in each block, essentially allowing these 16 units to communicate with each other in the stream output manner we discussed in the DX10 section.
Looking at the raw numbers, you can see that the GeForce 8800 SP architecture creates some impressive processing power, resulting in more than double the “multiplies” that the G71 or R580 could muster. Also, we found out that the new G80 SPs are 1 to 2 orders of magnitude faster than G71 was on branching — this should scare ATI as their branching power was one of the reasons their R580 architecture was able to hold off the 7900 series for as long as it did.
In previous generations of NVIDIA’s hardware, the texture units actually used a small portion of the pixel shader in order to keep from doubling up on some hardware. This has the potential to create “bubbles” in the GPU processing that looked like the GeForce 7-series diagram above. Math operations often had to wait for texture units to complete their work before continuing. That is no longer the case with G80; there are literally thousands of threads in flight at any given time allowing the memory access to completely decouple from processing work. This keeps those “bubbles” from occuring in this new design allowing for seemingly faster memory access times.
This threading process has been dubbed “GigaThreading” by NVIDIA and refers to the extremely high amout of active threads at any given time. In a CPU, when a cache miss occurs, the CPU usually has to wait for that data to be retrieved and the thread stalls as it waits. On the G80, if there is a data cache miss, the problem isn’t so severe as there are many threads ready to be placed into one of the 128 SPs while the data is retrieved for the cache miss. And in case you were wondering what this constant thread swapping might add to computing overhead, NVIDIA told us that it technically takes 0 clocks for threads to swap!
Moving on to the texture units on the G80, there are two texture fetch (TF) units for every texture address (TA) unit; this allows for a total of 32 pixels per clock of texture addressing and 64 pixels per clock of texture filter ops. These units have been optimized for HDR processing and work in full FP32-bit specifications, but can support FP16 as well. Because of all this power, the G80 can essentially get 2x anisotropic filtering for free, as well as FP16 HDR for free.
This small table compares the 7900 GTX, X1950 XTX and the 8800 GTX in terms of texture fill rates; on 32-bit dual texturing, the X1950 XTX and the 7900 GTX could get approximately 50% performance on 2x AF while the 8800 GTX gets 100% performance because and 3.6x faster 16x AF rates than the X1950 XTX.
The ROPs on G80 are changed a bit from the G71 architecture as well, starting with support for up to 16 samples for AA; however these are not programmable samples like those on the ATI X1950 architecture; NVIDIA’s are still using static, rotated grid sample patterns. As if we would allow NVIDIA to do otherwise, antialiasing is support with HDR! The ROPs can support up to 16 samples and 16 Z samples per partition with up to 32 pixels per clock Z-only per partition. The color and Z compression designs have been improved by a multiple of 2 and the ROPs now support 8 render targets.
With six ROP and Z partitions available, that gives the G80 a total of 96 AA samples and 96 Z samples per clock, as well as 192 pixels per clock of Z-only work. Also, each ROP has a 64-bit interface with the frame buffer; if you do your math you’ll come up with an odd-sounding 384-bit total memory interface between the GPU (and its ROPs) and the memory on the sytem. That 64-bit interface is attached to 128MB of memory, totalling 768MB of frame buffer. Yes, the numbers are odd; they aren’t the nice round numbers we are used to. But there is no trick to it as many had thought; NVIDIA is segregating some portion to vertex and some to pixel or anything like that.
The Z culling process on the G80 has been improved drastically as well, including the ability to remove pixels that are not visible during the rendering BEFORE they are processed. NVIDIA was pretty tight-lipped about the actual process on how they are doing this, but keeping in mind that the NVIDIA driver uses a just-in-time compiler before passing instructions on to the GPU, its possible that the compiler is doing some work to help the GPU out in this case. Either way, the more capable the Z-culling is on the core, the less work the GPU has to do per frame, improving performance and game play.