The L2 Cache and ComputeNVIDIA has implemented a large 768 KB L2 cache that is shared between the four GPCs. In previous architectures L2 cache usage was primarily that of read only texture information. With GF100 we now see a fully functional L2 cache that can be written to by the GPCs, raster units, and ROPS. This should improve overall memory efficiency, as commonly used instructions and data can be accessed in L2 cache very, very quickly.
We can see with this cache structure that main memory reads and writes are cut down dramatically. This results in reduced latency for many reads and writes, which translates to better overall performance.
This is again another change which has lead to far greater flexibility and efficiency by the GF100. When combined with the flexible shared/L1 cache of each SM unit, we see how much closer the GPU guys are coming to traditional CPU workflows and cache usages. It used to be that GPUS were primarily comprised of logic and small amounts of caches, but now GPUs are starting to more closely resemble CPUs with large amounts of cache as compared to logic. CPUs still contain a far larger proportion of cache to logic, but in several generations of graphics cards that are coming, we can expect those ratios to become closer and closer.
The benefits of more cache that is also more flexible in how it works.
Compute and GPGPU
The “all in one” philosophy that seems to surround Fermi was the biggest reason for its delay, but really is one of the most important reasons for its existence. Not only can it render games in industry leading fashion, it is also the most well rounded compute platform that NVIDIA has introduced to date.
Raytracing is an exciting example of what the compute nature of the GF100 is able to do. While previous generations of parts were not flexible enough, nor had the general capabilities that were required for raytracing, the GF100 is able to do a goodly portion of functions to allow certain types of raytracing to be done on the GPU.
Ryan covers the compute portion of the GF100 in his Fermi architecture overview that was introduced to us last August/September. While the theoretical numbers for the GF100 and competing HD 5870 are fairly close, the actual throughput of the GF100 should be significantly higher. AMD, while giving a nod to compute and GPGPU with their latest architecture, have not given it the tools needed to fully leverage the power of their 1600 stream units. This will not be an issue for NVIDIA and the GF100. Just as we saw the GPGPU performance of the GTX 280/285 eclipse that of the competing HD 4870 and 4890 chips, I believe we will see the same type of separation in performance between the GF100 and the Cypress chips from AMD. Each SM in NVIDIA’s GF100 looks far more CPU like than anything we have seen before, and the integer and floating point parallelism in each SM is simply astounding when looking from a traditional CPU standpoint.
AI functionality, such as pathtracing, gets a huge boost from the increased abilities of the CUDA cores as well as the compute performance in general.
One last area that has received a lot of attention from NVIDIA is that of context switching. With the CUDA cores doing work on geometry shading, tessellation, pixel shading, compute/gpgpu work, and PhysX, context switching used to involve a lot of overhead and lost cycles. Now context switching can occur once per clock. This has taken the time down to around 20 micro-seconds per context switch. This again should dramatically improve performance in applications which will use DirectCompute and GPU physics inside of a traditional gaming app which requires pixel shading.