has a new article up that deep-dives into the world of GPGPU computing and NVIDIA’s GT200 architecture.  Keep in mind this is not an article for the feint of heart – if lines like “Each cycle the issue logic selects and forwards the highest priority ‘ready to execute’ warp instruction from the buffer. Prioritization is determined with a round-robin algorithm between the 32 warps that also accounts for warp type, instruction type and other factors” don’t fly with you, get your wikipedia links warmed up. 

The piece is definitely worth a read though for anyone interested in how the GT200 and CUDA function. 

NVIDIA's GT200: Inside a Parallel Processor - Graphics Cards  1
Figure 3 above shows the system architecture of three throughput oriented processors, the G80, the GT200 and Niagara II. Note that the caches in the two GPUs are read-only texture caches, rather than the fully coherent caches in Niagara II. The GT200 frame buffer memory interface is 512 bits wide, composed of eight 64 bit GDDR3 memory controllers, compared to a 384 bit wide interface on the previous generation. The memory bandwidth varies across different models, but peaks at 141.7GB/s when the memory controller and memory are running at 1107MHz, approximately 65% higher than the previous generation. On top of a wider and higher bandwidth memory interface, the GDDR3 memory controller coalesces a much greater variety of memory access patterns, improving the efficiency as well as peak performance.