Bulldozer Architecture (continued)

The below text was taken from Bulldozer at ISSCC 2011 – The Future of AMD Processors.

The second topic covered at ISSCC was that of “40-Entry unified Out-of-Order Scheduler and Integer Execution Unit for the AMD Bulldozer x86-64 Core”.  Single thread performance is still of great importance for modern processors, and this has been an area where AMD has lacked as compared to the competition.  The first work to help achieve better single thread performance was that of the fetch/prefetch, branch prediction, and decode.  AMD has still not covered those portions in depth, other than we know that a lot of work has been done to each individual unit.

Each integer unit has its own scheduler.  Each integer unit is comprised of two execution units, and then two address generation units.  The execution units are further divided so that one handles multiply and the other divide.   These are again newly designed units which have very little in common with previous processor architectures.

The schedulers have some very interesting wrinkles to them.  First off is the support for 40 entries, out of order scheduling.  It also supports up to 4 x 64 bit instructions in flight.  Michael Golden presented the paper, and his quote about the clock characteristics of these tightly knit units is as follows:

The out-of-order scheduler must efficiently pick up to four ready instructions for execution and wake up dependent instructions so that they may be picked in the next cycle. The execution units must compute results in a single cycle and forward them to dependent operations in the following cycle. All of this is required so that the module gives high architectural performance, measured in the number of instructions completed per cycle (IPC).


What is perhaps the most interesting aspect of these new designs is the use of standard cells vs. fully custom cells.  Place and route of standard cells can be automated, and it is relatively easy to create complex designs fairly quickly.  Custom cell layout is very complex and time consuming, but it has the advantage of being very efficient in terms of power consumption, and has a higher switching speed than standard cell designs.  Somehow AMD has taken a standard cell design utilized on GLOBALFOUNDRIES 32 nm SOI process, and made it perform at custom cell levels.  The integer execution units and the scheduler can run at the same 3.5 GHz+ speed as the rest of the chip, even though it has portions of the design made with standard cells.

The full 8-core / 4-module Bulldozer Architecture found in AMD FX.

This apparently has allowed AMD to quickly and rapidly prototype these designs.  This has the advantage of being able to deliver to market faster than going with a fully custom part, and it also allows AMD to further test the performance and attributes of the standard cell design and possibly change it without the time and manpower constraints of custom cell.  How AMD has achieved this is beyond me.  Being able to implement standard cell design rules and achieve custom cell performance has been the holy grail of CPU/GPU design.  Obviously this has limitations, as the entire processor is not comprised of all standard cells.  I believe that Intel also utilizes some standard cell features in their latest series of processors, so AMD is not exactly alone here.

Power

Previous AMD processors were not designed from the ground up to implement complex and efficient power saving schemes.  Since Bulldozer is a new design altogether, the engineers are able to more effectively implement power saving into the processor.  Throughout the years we have seen small jumps forward from AMD with power saving techniques, but Bulldozer will be the first desktop/server product that will have a fully comprehensive suite of power saving technologies.

AMD FX-8150 Processor Review - Can Bulldozer Unearth an AMD Victory? - Processors 18

The CPU, in typical workloads (obviously does not include "Furmark" in SLI/Crossfire situations), takes up the majority of power in a system.  By being able to reduce a significant percentage of power draw at that one component will decrease the overall system draw to a great degree.


AMD now has fully gated power to the individual cores, which allows them to be completely turned off when not in use.  The replication of functional units (such as fetch and decode) for the individual cores also cuts down on the complexity, and thereby power draw, of the overall processor as compared to how many logical cores it has.  The clock grid (which provides the clock signals throughout the processor) also has been radically redesigned so as to be less of a power sink, and still be efficient in keeping the processor clicking along.

Clock gating, which turns off individual components such as execution units, has been much more thoroughly implemented.  There is something like 30,000 clock enables throughout the design, and it should allow an unprecedented amount of power savings (and heat reduction) even when the CPU is at high usage rates.  Even though a processor might be at 100% utilization, not all functional units are being used or need to be clocked.  By having a highly granular control over which units can be gated, overall TDP and heat production can be reduced dramatically even at high utilization rates.

AMD Turbo Core will also receive a great amount of attention.  The current Turbo Core we see in the X6 processors is somewhat underwhelming when we look at the overall complexity of AMD’s implementation.  For example, when three cores or less are being utilized on the X6 1090T, those cores will clock up to 3.6 GHz, while the other three go down to 800 MHz.  There is no real fine tuning of performance or TDP here, just an “on/off” switch for clocking half of the cores 400 MHz higher while downclocking the rest.  This is fairly basic as compared to Intel’s system.  Now it seems that AMD is implementing a system much like Intel’s.  We should see Turbo frequencies with differing numbers of cores which will be much more similar to what Intel offers with Sandy Bridge.

AMD FX-8150 Processor Review - Can Bulldozer Unearth an AMD Victory? - Processors 19

Due to the ground up design of Bulldozer, and the focus on decreasing power draw and heat production, we will see a nice reduction in power being utilized across the entire processor.

 

Bulldozer is a comprehensive blank sheet design which is very similar to the jump the company took going from the K5/K6 to the original Athlon.  AMD certainly hopes that it will be able to more adequately compete with Intel in terms of overall performance per watt, as well as die size and transistor count.  When the Phenom was originally detailed, many thought that it would prove to be the counter to the Core 2 that AMD needed, but unfortunately that design was not forward thinking enough in terms of design to adequately compete.  Up through the current generation of parts, Intel was able to use fewer transistors and a smaller die size to create products that were significantly faster than what AMD was able to provide.

« PreviousNext »