The Big (Architectural) DigThe GF100 chip is comprised of, in order of largest to smallest features, four Graphics Processing Clusters (GPCs), sixteen streaming multiprocessors (SMs), and 512 CUDA cores. While NVIDIA’s GF100 is still based on the same idea of the CUDA cores from previous G80 and GT200 architectures, not a single unit has remained unchanged from those previous chips.
The primary working unit of the GF100 is the CUDA core. As mentioned above, it is significantly changed from the previous scalar units which also carried the CUDA name. The foremost change is that the unit now features not only a fully pipelined floating point unit (FPU), but also a fully pipelined integer arithmetic unit (ALU). Each CUDA core is fully IEEE 754-2008 compliant, and also adds the FMA (fused multiply add) functionality for both single and double precision arithmetic.
Get used to seeing this functional unit on budget and midrange products that will show up after the GF100 is launched.
The ALU is a major change, mainly due to previous generations being able to do floating point only. The advantages that NVIDIA brings to the table with this are nicely summed up in their white paper detailing the GF100 architecture:
“In GF100, the newly designed Integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bitreverse insert, and population count.”
NVIDIA has made their GPU far more CPU-like with the addition of fully supported integer functionality. This will have little effect on gaming initially, but further down the road we may see more and more functionality that was once reserved for the CPU to be implemented directly on the GPU. Where it will make a huge initial splash will be in the GPGPU circles, in which the GF100 can be directly programmed to using various C++ level languages.
Two warp schedulers and two dispatch units keep the scaler CUDA cores, Load/Store units, and Special Function units at a high rate of utilization.
The next step up in terms of units is the streaming multiprocessors (SMs). Each SM is comprised of 32 CUDA cores, but it does not stop there. This is probably where the most work was done in terms of functional unit organization, with an eye towards efficiency and throughput. The SM unit now looks a lot more like a hugely parallel traditional CPU. It has a dedicated instruction cache which feeds into two warp schedulers and two dispatch units. These units then utilize a register unit which feeds the CUDA cores, Load/Store units, and the new Special Function units. 16 Load/Store units are able to calculate source and destination addresses for sixteen threads per clock. Each warp is then comprised of 32 threads. With two warp schedulers pushing out two warps (32 threads) independently, it is able to avoid dependencies from the other instructions in flight in the other warp. This allows a large amount of utilization from the LD/ST units, CUDA cores, and Special Function units.
There are four SFUs in each SM, and each can execute transcendental instructions such as sin, cosine, reciprocal, and square root. Graphics interpolation instructions can also be performed on the SFU. Each SFU can execute one instruction per thread, per clock. This means that a full warp sent to the SFUs will take 8 clocks to fully execute. The SFU, LD/ST units, and CUDA Cores are all decoupled from the dispatch unit, which allows the dispatch unit to issue to the other execution units without having to wait for another unit to finish its work. For example a 32 thread warp can be in flight to the CUDA cores at the same time as a full warp is being worked on by the SFUs, all the while the LD/ST units are handling the source and destination addresses for new warps being generated but not yet dispatched and executed.