NVIDIA GT200 Architecture
Part 1 of our 2-part story on the release of the NVIDIA GT200 GPU starts where you would think it would – architecture analysis and then some heavy hitting game play testing. Do the GTX 280 1GB and GTX 260 896MB cards stand up well to existing competition like the 9800 GX2 and 3870 X2?Introduction
It’s been a long time coming – despite NVIDIA’s assurances that G92 was more than just a die shrink of the well accepted G80 architecture we knew better. G92 was, and continues to be, a great GPU for your dollar but we are hungry for something new, something that would really push gaming into the next generation. As it turns out both NVIDIA and AMD have something planned for the month of June but NVIDIA’s new GT200 GPU is the first up to bat. NVIDIA GeForce GTX 200 series of graphics cards based on this design are incredibly powerful and incredibly expensive. Be prepared to be impressed.
The GT200 Parallel Processing Architecture
One of the big changes to the NVIDIA’s presentation at the most recent technology summit was how the company promoted and positioned the GT200 design. Going back nearly a decade to when these events started there was no doubt you were there to see a graphics card and GPU at work but now with the well-documented public fight with Intel looming over everything the company does their angle shifted. Now being called a “parallel processor” more often than a GPU, NVIDIA is deathly serious about pushing the premise of their GT200 being used for more than just gaming: video encoding, high-performance computing, folding and more. We are covering all of these aspects of the GT200 as well in a separate article: Moving Away From Just a GPU.
At its heart though, the GT200 shares a lot in common with the theory and design of the G80 architecture. It is NVIDIA’s second generation of unified shader design, their second GPU to use HybridPower technologies and the second to offer 3-Way SLI to wealthy gamers. The GT200 does have new tricks though including a drastically increased number of shader cores, double precision floating point and is also much, much bigger than its predecessor.
The NVIDIA GT200 is 1.4 billion transistors strong and being built on TSMC’s 65nm process technology (at least for now) makes it the largest chip the manufacturer has ever built. The shaders can produces as much as 933 GigaFLOPS of horsepower at top reference speeds while maintaining impressively low idle power consumption. But packing in 1.4 billion transistors into a 65nm design makes this one BIG chip: XX mm^2 to be exact. When the chip was being held up by a GeForce product manager to display, NVIDIA CEO Jen-Hsun Huang jokingly said, “That chip’s as big as my head!” An exaggeration to be sure, but an interesting one with important business implications we’ll discuss in our conclusion.
Below is a complete (as complete as NVIDIA has revealed) block diagram of what makes the GT200 tick:
All of those squares and colored sections are very specific components of the core that give the design such power. No doubt frequent readers of our articles that are in to the detailed, technical discussion here will recognize some key features: top setup block, lots and lots of stream processors, ROPs, memory controllers and more.
Let’s zoom in on one of the ten blocks of shader processors for a closer look:
Each of the ten divisions is made up of 24 separate shader processors bringing the GPU total to 240 SPs for those counting at home. The collection of 24 shaders is divided again into three sets of eight; these 8 SPs share a small block of localized memory for data sharing at 16K for each core. The larger L1 cache is shared between all 24 of the shader units and is used to increase memory performance and bandwidth in much the same was a primary L1 or L2 cache is used on a CPU.
Moving in even closer we can take a look at each of the individual SPs and find double register file (to help ever increasing shader program sizes) and three arithmetic units: one for FP, one for integer and one for moves and comparisons. This makes the GT200 a dual-issue design, much like the G80 was, that can handle both a MAD and MUL operation in a single clock, and puts the new design at 3 FLOPS per core per clock.
The shaders on the GT200 are unique though in that they also contain a floating point unit capable of handling double precision computing completely separate from the FP unit illustrated above. This feature probably won’t be used for gaming purposes at all since single precision calculations are more than adequate for any visual representations but it IS crucial for NVIDIA to have double precision support in the HPC market.
The filtering section of the design, shown as the red and black section at the bottom of the full block diagram, is updated as well for GT200. There are now 64 total samples for color and Z on each clock, divided up into eight segments of 8 each.
What does it all add up to? Here is a comparison of the theoretical bandwidth and performance numbers comparing GT200 to G80:
The GT200 has large advantages in ROP performance – the new GPU is able to handle 32 pixels per clock compared to the G80 that could only handle 12 pixels. That same performance increase is seen in the ROP blending numbers that put the GT200 at 19 giga-blends/s while the G80 pulled in only 7 GBL/s. The increase in cores (from 128 to 240) and frame buffer bandwidth give the new GT200 chip a huge leap forward in theoretical performance.
One area that did not increase much was the texture addressing performance: it jumped from 64 texels/clock to 80 texels/clock. Why did NVIDIA leave this area largely unchanged, at least in comparison to pixel shading performance? With each architecture design, GPU companies have to essentially “guess” which type of horsepower programmers will be using for years to come: shader processing or texture processing. Before the advent of unified shaders this task was even more difficult as designers had to decide between pixel and vertex shaders but it is still a risk. With G80, NVIDIA admitted that they overplayed their hand with regards to texture performance – they put more in than game developers needed, sacrificing some shader power to do so. AMD leaned more heavily in favor of shader power with their RV670 design which is why they consistently had a performance lead in synthetic shader processing benchmarks. The GT200 attempts to correct that by moving the ratio of floating power to texture power from 14:1 on the G80 to 19.4:1. We will soon see if that decision pans out as they hoped.
Not only does NVIDIA claim the GT200 has higher theoretical maximums on those key areas in processing and bandwidth, but they also say the GT200 pushes the limits on efficiency as well – basically how closely those maximums can be met. Our 3DMark06 and 3DMark Vantage GPU tests should be able to detail this more closely.
Another area of improvement on the GT200 core is in geometry shading. Thanks to a 6x increase in the output buffer compared to G80, NVIDIA can now claim to have geometry shading performance above any current AMD solution – the 8800 GTX was a far lower performer previously: