The R520 Architecture – A New Beginning
ATI has been steadily upgrading their last architecture, that came in the form of the Radeon 9700, with moderate success over the last couple of years – the Radeon 9800, X800 and X850 lines of graphics processors. It has been quite a while since we have seen a completely re-architected design from the ATI engineers and because of that the R520 architecture we are detailing here has a lot of pressure on its shoulders.
The R520 architecture has many dramatic shifts from the previous ones we have seen from ATI including a new focus on efficiency, a completely redone memory controller, SM3.0 support, increased image quality features and scalability to allow for a diverse product line with the same feature sets.
During my recent trip to San Francisco to attend the latest ATI Technology Days I was introduced to this new architecture by the people that created it and they took a keen interest in making sure we understood well enough to recognize the ingenuity behind it.
The entire architecture is seen here and it includes 16 pixel shader pipelines broken into four individual shader cores that are dynamically threaded by a dispatch processor. There are 8 vertex shader pipes, 16 texture address units, 16 texture units and 16 Render Back-End Units that are essentially the raster operators that write pixels, handle AA and Z culling.
The entire family of the X1000 series GPU is based on the 90nm process technology in order to make the changes they required in a price efficient manner. While the move to 90nm is important for several reasons including revenue, yields and speeds, the most important part by far is the architectural changes we are going to see on the R5xx cores.
At first glance the 16 pixel pipe and 8 vertex pipe engine in the R520 may look like yesterday’s news compared to the 24 pixel pipe G70 core but there is more to the design of the R520 that we have yet to discuss. Let’s look at the shader pipeline a bit closer.
Much like we saw in the G70 architecture and the NV40 before it, the R520 core has four sets of pixel shaders that are quad pixel shader cores for a total of 16 pixel shader pipelines. What differs here though is the Ultra-Threading Dispatch Processor that is incredibly essential to the new architecture. In a basic sense, this component is responsible for deciding which pixel groups should be worked on and when. Each pixel group is a ‘thread’ that is 16 pixels in size and can be moved in and out of the work queue by the dispatch processor.
Each pixel processor quad is broken down as per the diagram above. There are two scalar and two vector ALU units that can combine to perform any ADD, MUL or MADD in a single clock. The branch prediction unit, shown in purple, is a new addition and can perform a single flow control instruction every clock and allows for faster processing of dynamic flow control.
Dynamic flow control is one the key features of Shader Model 3.0 and has been around since we first saw NVIDIA support SM3.0 with their NV40 architecture. However, ATI is claiming to have ‘done SM3.0 right’ by including additional logic that can accelerate flow control even with a large number of threads in order to prevent the inherit latency involved with if/then statements and loops that any programmer will recognize.
This flow control logic, and the very small thread size of 16 pixels, allow the ATI R520 architecture to properly handle the hundreds of simultaneous threads that exist inside the core at any given time. With the shader pipeline we showed you before, we know that each thread can perform up to 6 different shard instructions on 4 pixels per clock with threads of 16 pixels, allowing the R520 to have fine-grain parallelism that is useful in many ways.
In this diagram, the red squares indicate groups of pixels that will have to run through both sides of the sample shader and thus take twice as long to complete their processing. You can see that with a smaller thread size, the granularity that exists is much smaller as there are far fewer pixels that are going to be cycled through both sides of the shader loop with a 16 pixel thread as opposed to a 256 or 4096 pixel thread.
Of course having threads of this size comes with some trade offs, including the need for the additional logic that dispatches and holds threads for processing. The need to hold threads at a certain state while working on other, more important threads requires the inclusion of a new, larger register array that can store the values and states for all the pixels in a thread. The general purpose register array has been enlarged for this reason.
Also because you have many more threads cycling through the core you need to have a much faster branch execution unit in order to keep up. With ATI’s R520 architecture you can see that the separate branch execution unit eliminates the cycles of overhead by figuring out which path the pixels are going to be taking through shader code in parallel with the other work; much like texture look ups have been handled for some time). This results in shaders with flow control executing in fewer clock cycles.
Starting with the R520, all shader calculations are done in 128-bit floating point precision at full speed and no longer drop down to partial precision. In fact, the new ATI core does not recognize partial precision and will only work on data in full 128-bit precision. This is yet another reason why the general purpose registers were enlarged.
The vertex engine in the R520 remains mostly unchanged. It still sports 8 full vertex shader pipelines that each can work on 2 shader instructions per clock. These have been upgraded to support SM3.0 with dynamic flow control and up to 1024 instructions in total.
Another improvement in the ATI design this time around is the ability to decouple the rendering components allowing for a much more flexible architecture. This allows the ATI engineers to create a core with fewer pixel or vertex pipes while still maintaining the feature set of the architecture completely. The number of each component can be varied independently and thus ATI can release optimal product lines based on users’ needs. This is why when we discuss the X1800 vs. the X1600 vs. the X1300 you’ll see varying numbers of pipelines and components that we will touch on in detail there. NVIDIA has had this ability in both their G70 and NV40 architectures.