TurboCache – Using the PCI Express Bus
Before discussing TurboCache, we need to understand the problem with “traditional” graphics subsystems. For the majority of graphics solutions, render surfaces (such as environment mappings and stencil buffers) are located in local memory on the graphcs card. This forced the card vendors to use larger amount of local graphics memory thereby raising costs for the end user. The AGP bus was the first attempt by the industry to try and alleviate this issue, but the AGP bus was not designed for writing memory across the bus, only reading. This makes recent trends/effects in gaming technology, like rendering to textures, very slow because of a basically one-way AGP bus.
NVIDIA’s TurboCache, utilizing the high bandwidth of the PCI Express bus, is able overcome that obstacle.
TurboCache is an NVIDIA solution that allows you to utilize system memory as a graphics buffer in much improved fashion thanks to PCI Express. TC has the ability to render to and from system memory efficiently, the ability to texture from system memory as well as the ability to dynamically allocate and de-allocate system memory for its use.
The memory allocation process on the 6200TC is pretty straight-forward process. The memory buffer that is stored in main memory is controlled and maintained by the graphics driver. Using standard Microsoft approved methods for memory allocation, idle pages of system memory are used and mapped on demand as the driver determines they are needed. The memory is allocated and released on demand meaning that no memory is statically locked down as is the case with most integrated graphics solutions. Graphics drivers have been doing this for some time already with the AGP bus and even the initial PCIe cards, but NVIDIA’s TurboCache technology extends the functionality to include renderable surfaces and writing to them.
6200TC vs “other” GPU
NVIDIA was quick to point out that the 6200TC still requires a local frame buffer to work — a minimum of 16 MB — and the scan out of the final image will always come directly from the local frame buffer in order to reduce latency problems to frame rate.
For this first iteration of TurboCache, NVIDIA has limited the total addressable memory of the 6200TC to 128 MB, including the local frame buffer. That means a card with 32 MB of local memory will allocate at most 96 MB of system memory and the 16 MB version will allocate at most 112 MB of system memory. I say “at most” because in most instances, these maximums aren’t reached. During the presentation, we were shown a slide with a few applications and the system memory allocated during run time. Far Cry showed the most usage at 76 MB while UT2004 used only 16 MB and Doom 3 used 32 MB. Though no information was given about the resolution, we have to assume they were run at 800×600, with either medium or high quality settings. Overall though, system memory usage seemed manageable.
The NVIDIA TurboCache is also intelligent enough to not allocate more memory than is necessary or that will adversly affect performance. The maximum of 112 MB that can be allocated is on systems with 512 MB of system memory or more. PCs with less main memory will have less system memory being allocated to a frame buffer. Also, NVIDIA did tell us that some time next year they expect to have the TurboCache technology able to allocate as much as 256 MB of memory in systems with over 1 GB of system memory.
It’s not all a bed of roses though, as using system memory instead of local memory for a graphics card is going to have latency issues. Memory access from the GPU to the main system memory must travel from the GPU, over the PCI Express bus, to the either the north bridge or CPU (depending on your CPU), then to main memory, and back again. That is obviously a longer process than simply referencing on-board memory chips. Though the PCI Express bus does offer high bandwidth, this latency issue needed to be addressed by NVIDIA in order to offer competitive performance.
The NV44 chip went through some slight modifications in order to reduce the effects of higher latency system memory. The pipelines and the way they are fed data have been changed so that the memory access required by the GPU are done before they are needed. Basically the data the GPU needs is ready whenever the GPU needs it. This is not simply a “prefetch” for the 6200TC, instead the hardware pipelines itself have been stretched to accomodate the added latency. This can be a success, NVIDIA says, because the work a GPU does is largerly parallel in nature and has many indepenent “threads” that allow for memory to be accessed more linearly than randomly. You may remember that this was RAMBUS’ big setback — the inability to mask latency in a highly random memory reading cycle.
The one last issue to be discussed is the performance of these cards on various chipsets. Since accessing the system memory is obviously key to the performance of the card, the chipset can have a big impact on overall gaming performance. NVIDIA claims that their NF4 chipset outperforms the the Intel 915 chipset by a noticeable amount when using the same 6200TC card. Also, the performance delta between a 16 MB and a 32 MB version of the 6200TC is greater as well on an NF4 chipset as memory access are much faster.