Integer Scheduler and Execution Unit
The second topic covered at ISSCC was that of “40-Entry unified Out-of-Order Scheduler and Integer Execution Unit for the AMD Bulldozer x86-64 Core”.  Single thread performance is still of great importance for modern processors, and this has been an area where AMD has lacked as compared to the competition.  The first work to help achieve better single thread performance was that of the fetch/prefetch, branch prediction, and decode.  AMD has still not covered those portions in depth, other than we know that a lot of work has been done to each individual unit.

Each integer unit has its own scheduler.  Each integer unit is comprised of two execution units, and then two address generation units.  The execution units are further divided so that one handles multiply and the other divide.   These are again newly designed units which have very little in common with previous processor architectures.

The schedulers have some very interesting wrinkles to them.  First off is the support for 40 entries, out of order scheduling.  It also supports up to 4 x 64 bit instructions in flight.  Michael Golden presented the paper, and his quote about the clock characteristics of these tightly knit units is as follows:

The out-of-order scheduler must efficiently pick up to four ready instructions for execution and wake up dependent instructions so that they may be picked in the next cycle. The execution units must compute results in a single cycle and forward them to dependent operations in the following cycle. All of this is required so that the module gives high architectural performance, measured in the number of instructions completed per cycle (IPC).

What is perhaps the most interesting aspect of these new designs is the use of standard cells vs. fully custom cells.  Place and route of standard cells can be automated, and it is relatively easy to create complex designs fairly quickly.  Custom cell layout is very complex and time consuming, but it has the advantage of being very efficient in terms of power consumption, and has a higher switching speed than standard cell designs.  Somehow AMD has taken a standard cell design utilized on GLOBALFOUNDRIES 32 nm SOI process, and made it perform at custom cell levels.  The integer execution units and the scheduler can run at the same 3.5 GHz+ speed as the rest of the chip, even though it has portions of the design made with standard cells.

This apparently has allowed AMD to quickly and rapidly prototype these designs.  This has the advantage of being able to deliver to market faster than going with a fully custom part, and it also allows AMD to further test the performance and attributes of the standard cell design and possibly change it without the time and manpower constraints of custom cell.  How AMD has achieved this is beyond me.  Being able to implement standard cell design rules and achieve custom cell performance has been the holy grail of CPU/GPU design.  Obviously this has limitations, as the entire processor is not comprised of all standard cells.  I believe that Intel also utilizes some standard cell features in their latest series of processors, so AMD is not exactly alone here.

Power

Previous AMD processors were not designed from the ground up to implement complex and efficient power saving schemes.  Since Bulldozer is a new design altogether, the engineers are able to more effectively implement power saving into the processor.  Throughout the years we have seen small jumps forward from AMD with power saving techniques, but Bulldozer will be the first desktop/server product that will have a fully comprehensive suite of power saving technologies.

Bulldozer at ISSCC 2011 - The Future of AMD Processors - Processors 5

The CPU, in typical workloads (obviously does not include “Furmark” in SLI/Crossfire situations), takes up the majority of power in a system.  By being able to reduce a significant percentage of power draw at that one component will decrease the overall system draw to a great degree.

AMD now has fully gated power to the individual cores, which allows them to be completely turned off when not in use.  The replication of functional units (such as fetch and decode) for the individual cores also cuts down on the complexity, and thereby power draw, of the overall processor as compared to how many logical cores it has.  The clock grid (which provides the clock signals throughout the processor) also has been radically redesigned so as to be less of a power sink, and still be efficient in keeping the processor clicking along.

Clock gating, which turns off individual components such as execution units, has been much more thoroughly implemented.  There is something like 30,000 clock enables throughout the design, and it should allow an unprecedented amount of power savings (and heat reduction) even when the CPU is at high usage rates.  Even though a processor might be at 100% utilization, not all functional units are being used or need to be clocked.  By having a highly granular control over which units can be gated, overall TDP and heat production can be reduced dramatically even at high utilization rates.

AMD Turbo Core will also receive a great amount of attention.  The current Turbo Core we see in the X6 processors is somewhat underwhelming when we look at the overall complexity of AMD’s implementation.  For example, when three cores or less are being utilized on the X6 1090T, those cores will clock up to 3.6 GHz, while the other three go down to 800 MHz.  There is no real fine tuning of performance or TDP here, just an “on/off” switch for clocking half of the cores 400 MHz higher while downclocking the rest.  This is fairly basic as compared to Intel’s system.  Now it seems that AMD is implementing a system much like Intel’s.  We should see Turbo frequencies with differing numbers of cores which will be much more similar to what Intel offers with Sandy Bridge.

Bulldozer at ISSCC 2011 - The Future of AMD Processors - Processors 6

Due to the ground up design of Bulldozer, and the focus on decreasing power draw and heat production, we will see a nice reduction in power being utilized across the entire processor.

In Closing

Bulldozer is a comprehensive blank sheet design which is very similar to the jump the company took going from the K5/K6 to the original Athlon.  AMD certainly hopes that it will be able to more adequately compete with Intel in terms of overall performance per watt, as well as die size and transistor count.  When the Phenom was originally detailed, many thought that it would prove to be the counter to the Core 2 that AMD needed, but unfortunately that design was not forward thinking enough in terms of design to adequately compete.  Up through the current generation of parts, Intel was able to use fewer transistors and a smaller die size to create products that were significantly faster than what AMD was able to provide.

All indications so far point to Bulldozer being at the very least a competitive design.  I believe that Intel will have an advantage in instructions per clock when handling single threaded and lightly threaded workloads.  But AMD certainly looks to counter that by providing processors which will clock higher than the Intel counterparts, yet still remain in the same thermal envelope as the competition.  AMD has also made a big push to cut down the transistor count yet still retain the necessary performance to compete.  We should see leaner, meaner die sizes from AMD when compared to Intel products in the same performance range.  Consider that the Phenom II X4 processors had almost the same die size as the Core i7 9×0 series from Intel, but simply could not compete at the same level.  Bulldozer looks to change that.

If AMD has designed the front end of each core as we hope they have, then in heavily threaded applications Bulldozer should have a distinct performance advantage as compared to the SMT based Intel parts.  This is not a given though.  Processor design is hard.  This much is obvious, as there are few CPU companies out there.  AMD has an aggressive approach with the Bulldozer design, and I can foresee much more work being done with the fetch and decode units in further generations of products to more adequately feed the integer and floating point execution units.  That being said, it still looks to be a very fast part across a variety of workloads.

Until AVX hits primetime, AMD should again have a performance advantage with their FPU/SIMD design.  Being able to do 4 x 64 bit or 2 x 128 bit FP/SSE instructions per clock will give much higher throughput than the competing Intel unit.  Only when AVX instructions are run will we see Intel take the lead with their current designs.

Bulldozer has a lot of heavy expectations being laid upon it.  And so far, the people at AMD have seemed very excited about it.  Probably much more excited about the potential of this part as compared to the first Bobcat based Fusion processors (which have already proven to be a hit with OEMs and consumers alike).  We again must temper our expectations though, as we have been let down multiple times in the past by AMD and their new wonderchips.  Then again, the original Athlon and the follow up Athlon 64 proved to be quite successful and gave Intel a serious run for its money.  For the sake of competition, I hope Bulldozer can deliver.

More details can be found at AMD’s Bulldozer Blog.


« PreviousNext »