More on SMs
    NVIDIA made another major change in that they included four TEX units in each SM.  This improves overall efficiency for the architecture by more closely associating the TEX units to the shader units.  They have also increased performance by clocking these TEX units much higher than what they have previously been.  Each TEX unit is clocked at ½ the speed of the CUDA cores.  So if we assume that NVIDIA is aiming for a 1.5 GHz clockspeed for the CUDA cores, then the TEX units will run at 750 MHz, which is still higher than what we see with current GT200 based cards clocked at 600 MHz to 650 MHz (non-overclocked).

NVIDIA GF100 Architecture Preview - Fermi brings DX11 to the desktop - Graphics Cards 21

Texturing performance with the more tightly coupled TEX units running at a higher clockspeed shows significant performance increases over that of the GT200.

    The caches in each SM have also received a total makeover.  Each SM features a shared cache/L1 as well as a large register and smaller tex caches.  The L1/Shared cache is 64 KB in size, and can be dynamically allocated to L1 and shared, depending on the programs being executed.  The maximum amount that L1 or shared can capture is 48 KB, while the other portion gets the remaining 16K.  So in GPGPU workloads we would expect the L1 to take up the full 48KB, while the shared portion will take the 16KB.  In many gaming situations the opposite would be true.

    The final and most significant portion of the improved SM units is the PolyMorph Engine.  This is related directly to geometry throughput and tessellation.  If we look at previous architectures, geometry performance has improved at a glacial pace as compared to pixel shading performance.  NVIDIA figures that pure geometry performance has only increased by a factor of 3X from the GeForce FX 5900 to the GTX 285, but pixel shading performance has improved by 150X.  The PolyMorph units are closely associated to the CUDA cores and SMs because of the workload that tessellation incurs.  Geometry shading and tessellation requires a lot of work from the SM units in general, and that data is frequently passed between the SM and PolyMorph engine, depending on what stage of rendering is being done.  There are five stages to each PolyMorph engine (Vertex Fetch, Tessellation, Viewport Transform, Attribute Setup, and Stream Output), and between each stage results are returned to the SM where further work is done, and that work is then sent to the next stage of the PolyMorph Engine.  If NVIDIA had decided to slap a dedicated Tessellation unit to their previous designs, it would have incurred a huge latency penalty.  By tightly integrating the PolyMorph engine into the SM, and providing 16 SM units per GF100 chip, NVIDIA expects to see a neat 4X+ improvement in tessellation performance as compared to the current HD 5870 cards.

NVIDIA GF100 Architecture Preview - Fermi brings DX11 to the desktop - Graphics Cards 22

The PolyMorph Engine will be pushing tessellation into overdrive on the GF100.  Again, NVIDIA has optimized for locality, throughput, and lower latency by including this into the SM unit.

    NVIDIA has also tightly coupled the raster unit to each SM.  After the PolyMorph engine has processed the primatives, they are sent to the raster engine.  There are three stages to the raster engine, and these are Edge Setup, Rasterizer, and Z-Cull.  This basically sets up the pixels that are going to be viewed, and discards those that will not.  Once this is complete then post processing and pixel shading is done by the SMs.  There are four raster units on the GF100, and each raster unit serves up to 4 SMs.

    The next level of the architecture is the GPC units.  Each GPC contains the four SM units and one Raster Unit.  The four GPCs are then connected to a large 768 KB L2 cache.  This L2 cache in previous generations was reserved for read only TEX data.  The new L2 cache is now fully writable by the GPCs, TEX units, and ROP partitions as needed.  This can dramatically cut down on main memory accesses for frequently used tex info and instructions/data.

NVIDIA GF100 Architecture Preview - Fermi brings DX11 to the desktop - Graphics Cards 23

Having a dedicated raster engine servicing four of the 16 SMs at once again helps in terms of data locality, efficiency, and throughput.

    The four GPCs share six 64-bit memory controllers, which results in a 384 bit memory bus supporting up to GDDR-5 memory.  The host interface for the chip connects directly to the GigaThread Engine, which feeds instructions and data to the GPCs.

« PreviousNext »