A New Memory Controller
Besides the brand new 3D architecture that the R520 introduces we are also seeing a completely redesigned memory bus architecture as well. This topic is very difficult to conceptualize but I’ll do my best to explain it.
Standard memory controllers like the one found in the R4xx core are usually tangled messes of traces going back and forth between the memory controller and the DRAM modules on the board. In its simplest form, a central memory controller might be bordered by four memory chips: one to the north, south, east and west. There are then traces going from the controller directly to each of these memory chips. Then there are ‘clients’ or application requests for information from memory that attach themselves directly to the memory controller. In order for a client to read data from memory, it has to make the request to the controller, let the controller find the data, wait for the controller to receive the data and then wait for the controller to give the data back to the client. In the new Ring Bus memory architecture ATI has created, things are a bit different.
This complex diagram can be broken down in the same way. The memory controller is still centralized, but instead of directly connecting to each memory chip, it now communicates with ‘ring stops’ that are the one connected to one or two memory modules. Memory accesses are made by passing a ‘token’ around from stop to stop in order to determine which ring stop may communicate with the controller at any given clock. The ring stop can get data from memory though even when it does not have the ‘token’ allowing the memory controller to work on a different client’s request. And when the ring stop has data the client requests, it can pass it directly to the client without the assistance of the memory controller.
This does add another complexity to the controller though in the aspect of memory request arbitration. Similarly to how the 3D core of the R520 has a dispatch core, the new memory controller has an arbitration unit to decide which client’s request should get priority on the ring bus. This arbitration is done mostly on the software side, meaning that the Catalyst driver will have a lot to play in memory controller improvements on a per game basis from now on.
Using a weighting system that is basically a software algorithm designed by the ATI engineers, the arbitration system decided at any given point which client has the most ‘need’ for their data request to be filled. The client with the highest priority gets access to their data first and the weights are adjusted based on other parameters and the cycle continues.
The actual bus that the rings run around are 256-bits wide and run in opposite directions in order to reduce latency. Because this trace routing is much less complex, ATI is able to ramp up their memory clock speeds to levels they wouldn’t have been able to do so otherwise.
These new memory are now essentially 32-bits wide each on the X1800 architecture (shown at the top) while on the X850 architecture they were 64-bits wide. The new memory architecture can also be dynamically changed down in channels, even down to 0 channels if needed. My guess is that this is way for ATI to make a HyperMemory card with no on-board memory if the need arises.
The cache design has also changed pretty dramatically on the R520 memory controller. Caches are now full associative meaning cache lines can now map to any location in external memory where as in previous cache designs the maps were limited to direct mapping of external memory. And since the texture, color, Z and stencil caches are fully associative as well, we will see a reduction in memory bandwidth requirements when taking advantage of these features. ATI is claiming gains of up to 25% clock for clock in fill rate and bandwidth bound cases compared to the X850.
Finally, the new memory controller has an improved hierarchical Z buffer that detects and discards hidden pixels before shading work is done on them, saving valuable processing power. The new ATI technique uses floating point for increased precision and can catch up to 60% more hidden pixels than the X850. That is a lot of work that can be moved out of the way for pixels that the user might actually see!
All of these new memory enhancements are going to be more noticeable in the most bandwidth demanding of situations including higher resolutions (1600×1200+) and with AA and AF turned up.