Physical Design and Power
AMD worked very closely with GLOBALFOUNDRIES to customize their design as closely as possible to GF’s 14nm FinFET process. GF licensed their 14nm FF process from Samsung, but have done some tweaking of their own. They helped AMD developed a standard cell library that would maximize performance and area. This allows AMD to utilize more automated place and route of these cells, but achieve areal and power improvements compared to more general standard cells and automation. Between hand tuned areas, the specialized std cell libraries, and the notched SRAM design AMD is able to achieve some impressive area improvements as compared even to the competition which features a majority of hand layout of transistors.
Utilizing a FinFET structure AMD was able to improve power and switching characteristics as compared to previous 32 nm PD-SOI used on Bulldozer/Piledriver and 28 nm planar HKMG used on Steamroller and Excavator. This change was one of the larger factors in improving the perf/watt of Zen as compared to previous CPUs.
The largest factor of improving overall efficiency was the power delivery setup that AMD has implemented. Specialized capacitors in the upper metal layers are strategically placed around the design. When portions of the CPU are downclocked or put in a low power state, these capacitors provide near instant power to these areas without the typical voltage droop we see in traditional designs. This allows the CPU to react faster to changes in P states.
AMD has taken power and clock distribution to a new level with Zen. Previously AMD had greatly expanded the power distribution in Excavator, and they have done even more with Zen. Being able to effectively and efficiently route power through a complex design is no simple task. Such changes increase metal layer complexity.
Per core variation is a big factor in these large, multi-core CPUs. It could very well be that most of the cores are able to run at a faster speed at lower power, but one core in particular requires significantly more power to run at that same speed. Previously all cores were then run at the higher power level so that the one can achieve the required speeds. AMD takes this core variation and individually characterizes each core and power pull at certain clocks. This is all done in the chips and not at the fab. The more efficient cores can reduce their voltage independently from the slower, less efficient core. This then saves power consumption throughout the core, allowing for better efficiency or higher speeds depending on the current focus of the chip.
To achieve these power results AMD has implemented a large and complex sensor net across the CPU that monitors multiple characteristics. 1300+ critical path monitors which can report when there are issues with the overall speed of the chip (clocking too fast and not propagating signal correctly). 48 high speed power supply monitors which can report when more power is required in certain parts of the core or when there is little draw. 20 thermal diodes throughout the design that reports to the controller if portions of the chip is getting too hot or where there is more thermal headroom so the chip can continue to be clocked up. Finally 9 high speed droop detectors that help to minimize voltage droop and informs the control fabric where to increase power.
These sensors all provide information that allows the control systems to maximize performance and power consumption. This translates into higher clocks when needed all the while controlling power consumption and heat production so the chip stays within thermal limits.
One area that was not covered for us by AMD is that of their Infinity Fabric. This will be the communication basis for many of AMD’s upcoming products from CPUs to APUs to GPUs. The fabric is not one particular technology, but several all wrapped together. It is divided into Data Fabric and Control Fabric. Data Fabric includes a coherent Hypertransport link that allows the CPU cores to communicate with each other very quickly and with low latency. This also allows multiple CCX modules to be combined yet still access the other CCX modules incredibly quickly. Control Fabric is the portion that integrates all of the sensors about the chip and communicates with the system management unit. That then sends signals to the clock and voltage controls to increase or decrease speeds and voltages. There are many questions about this technology that AMD will be revealing at a later date.
All of these technologies come together to make one of the largest jumps in performance and efficiency through AMD’s time in the industry. The core performance has increased the IPC a tremendous amount as compared to the last generation Excavator core. The usage of the new 14nm FF process has allowed AMD to increase power efficiency and lower die sizes tremendously as compared to their previous 28 nm HKMG planar designs.
If there was one area that I was truly astounded by it is that of the base and turbo clockspeeds that AMD was able to achieve with this product. The 1800X being able to run at a base 3.6 GHz and an enhanced speed of 4.0 GHz is very impressive considering that this is a totally new design being developed on a new process. Intel prefers the “tic-toc” pattern to develop a new design on a well known process and then develop a new process to produce a well known design. This process takes many variables out of the mix, but AMD was able to balance it all out and provide a very impressive part right off the bat. This is a far cry from disappointments like the original Phenom.
All told AMD has delivered an impressive product to consumers despite many challenges and hardships for the company along the way.
Another great analytical
Another great analytical article! Thanks Josh, keep ’em coming!
“The L3 cache acts as a
“The L3 cache acts as a victim cache which partially copies what is in L1 and L2 caches.”
It seems like it would be the opposite. The L3 is a victim cache for lines evicted from the L2, so by definition the L2 and L3 should never duplicate anything. This is why they are saying 20 MB for the 8 core chip (4 MB L2 + 16 MB L3).
You could very well be right.
You could very well be right. AMD was not exactly forthcoming about how it exactly works, but you certainly are correct about the definition of a victim cache (evicted cache lines). I need to ask a few more questions and get it correct. Thanks for reading!
Welp, looks like I’ll stick
Welp, looks like I’ll stick with my 5 yr old Intel Sandy V jay jay which will still outperforms Rypoo @ 1080p as I can’t wait 5 years for developers not to implement so called AMD optimizations.