The Architectural Deep Dive
Josh sits down and gives us his thoughts on the brand new architecture from AMD that powers upcoming Temash and Kabini APUs.
AMD officially unveiled their brand new Bobcat architecture to the world at CES 2011. This was a very important release for AMD in the low power market. Even though Netbooks were a dying breed at that time, AMD experienced a good uptick in sales due to the good combination of price, performance, and power consumption for the new Brazos platform. AMD was of the opinion that a single CPU design would not be able to span the power consumption spectrum of CPUs at the time, and so Bobcat was designed to fill that space which existed from 1 watt to 25 watts. Bobcat never was able to get down to that 1 watt point, but the Z-60 was a 4.5 watt part with two cores and the full 80 Radeon cores.
The Bobcat architecture was produced on TSMC’s 40 nm process. AMD eschewed the upcoming 32 nm HKMG/SOI process that was being utilized for the upcoming Llano and Bulldozer parts. In hindsight, this was a good idea. Yields took a while to improve on GLOBALFOUNDRIES new process, while the existing 40 nm product from TSMC was running at full speed. AMD was able to provide the market in fairly short order with good quantities of Bobcat based APUs. The product more than paid for itself, and while not exactly a runaway success that garnered many points of marketshare from Intel, it helped to provide AMD with some stability in the market. Furthermore, it provided a very good foundation for AMD when it comes to low power parts that are feature rich and offer competitive performance.
The original Brazos update did not happen, instead AMD introduced Brazos 2.0 which was a more process improvement oriented product which featured slightly higher speeds but remained in the same TDP range. The uptake of this product was limited, and obviously it was a minor refresh to buoy purchases of the aging product. Competition was coming from low power Ivy Bridge based chips, as well as AMD’s new Trinity products which could reach TDPs of 17 watts. Brazos and Brazos 2.0 did find a home in low powered, but full sized notebooks that were very inexpensive. Even heavily leaning Intel based manufacturers like Toshiba released Brazos based products in the sub-$500 market. The combination of good CPU performance and above average GPU performance made this a strong product in this particular market. It was so power efficient, small batteries were typically needed, thereby further lowering the cost.
All things must pass, and Brazos is no exception. Intel has a slew of 22 nm parts that are encroaching on the sub-15 watt territory, ARM partners have quite a few products that are getting pretty decent in terms of overall performance, and the graphics on all of these parts are seeing some significant upgrades. The 40 nm based Bobcat products are no longer competitive with what the market has to offer. So at this time we are finally seeing the first Jaguar based products. Jaguar is not a revolutionary product, but it improves on nearly every aspect of performance and power usage as compared to Bobcat.
The Architectural Deep Dive
The Jaguar architecture will be powering both the Temash and Kabini products. This is a native quad core part with 2 MB of shared L2 cache. It features a 64 bit memory controller which can support LDDR-3 memory up to 1600 MHz speeds. The graphics portion of the APU is also upgraded to the latest GCN architecture. The chips are manufactured on TSMC’s 28 nm HKMG process.
From 10,000 feet the individual cores are very similar to Bobcat. There are dual decode units, dual ALU execution units, and similar L1 cache sizes; having said that, no single piece of the architecture was left untouched. The usage of the 28 nm process allowed for a greater transistor budget as well as speed optimizations to allow for higher clockspeeds, yet still remaining in the same TDP range.
The front end has some major upgrades which improve power consumption, improve performance, and improve overall clockspeed characterisitics. The i-cache is still 32 KB in size, but is redesigned in that only four banks light up when doing an i-cache read. This results in a 75% decrease in the power consumed as compared to Bobcat and the same operation. They worked on the IC prefetcher to improve overall IPC, grew the instruction buffer for improved fetch/decode decoupling, and added a new decode stage to increase clockspeed. Usually adding stages and increasing clockspeeds is detrimental to power usage, but with the advantages gained from using the smaller 28 nm process AMD was able to offset these disadvantages and the extra stage improved performance far more than decrease in power efficiency. The net result of these improvements is both power savings and increased IPC. This will be a common theme throughout this article.
The integer units are next. Like Bobcat, the Jaguar based cores feature two ALUs, a single LD AGU (load address generation unit), and a single ST AGU (store address generation unit). Jaguar features a new hardware divider that was taken from Llano (higher efficiency and higher performance). AMD added a lot of new operations support for Jaguar, and on the integer side this comes in the form of new/improved cops like CRC32/SSE4.2, BMI1, POPCNT, and LZCNT. A lot of these are subsets of SSE 4.2, but are implemented on the integer side. Interestingly enough, AMD has featured POPCNT and LZCNT since Barcelona (the original Phenom). AMD has increased out-of-order resources so it can more efficiently utilize the integer pipelines by having a larger selection of operations to choose from. Finally there are larger schedulers and a larger re-order buffer (again for better OOO performance and efficiency).
The floating point unit also received some major changes. It is a native 128 bit unit as compared to the dual 64 bit units of Bobcat. It can push out 4 single precision multiplies and four single precision adds per cycle. It can do those operations or 1 double precision multiply and 2 double precision adds. It now supports 256 bit AVX instructions by double pumping the 128 bit unit (AMD did not mention how many cycles this particular operation would take though). New zero optimizations were added. Finally AMD added a second FP physical register file stage to increase frequency. This again is a tradeoff for power efficiency that more than makes up for itself in performance and clock headroom.
The 32 KB data cache and the load/store queues again receive a large amount of attention and are essentially redesigned. The L1D is still 32 KB and 8 way associative, the L2D TLB is 512 4K pages, and it features the 8-stream data cache prefetcher. AMD included an improved OOO picker, improved STLF, and features less store data shuffling. This again improves OOO performance overall and fewer wasted cycles waiting for data. The DCache also features a true 128 bit path to the FPU. The overall improvements here will greatly affect both Integer and Floating Point/SSE/AVX performance as compared to the original Bobcat.
The final “core” improvement is to the portion which communicates with the L2 cache and the northbridge (I/O, memory controller, crossbar). It includes eight data cache miss per prefetch, three instruction cache miss per prefetch, and improved write combining with four WCBs (write combine buffer).
The L2 cache was redesigned from the ground up. No aspect of the L2 cache went untouched here. All connections are routed through the L2 interface. This means I/O, memory, and GPU accesses all utilize the L2 cache interface. All of the changes are nearly too numerous to list, but this diagram captures most of the major changes for the L2 interface. Note that while the L2 interface block runs at clock core, the L2D caches run at half speed to save power. L2 caches are fairly notorious for their power consumption, and AMD has designed them to still provide very good performance but to burn up a lot less power in the process.
There is a total of 2 MB of L2 cache that is 16-way set associative. There are four banks of 512 KB each. The L2 cache again runs at half speed, but it can also be turned off when not in use. It is also dynamic in that if a single core requires more cache, it can be allocated to that core. This obviously increases performance in single threaded applications as the core can access a lot more L2 than when all four cores are equally loaded. The L2 cache is also inclusive. For many years it appeared as though exclusive cache was the better implementation, but this was before L2 caches became as large as they have. With previous AMD architectures a smaller 256 KB cache would have to share space with either 64 KB or 128 KB of L1 in an inclusive architecture. So AMD chose exclusive to maximize cache space. Now that the entire cache is 2 MB in size and the individual L1 caches of each core are 64 KB, there is a lot more wiggle room and performance does not take a hit due to cache pressure. This does help efficiency and performance as every line of L1 i and d cache are represented, and the L2 is used as a probe filter. Essentially when a core changes a value in L1 cache, and another core requires that information, instead of probing the L1 cache of that other CPU, it can go directly to the shared L2 cache for that information. This saves a few extra hops from core to core, which improves performance and lower power consumption.
Each Jaguar core is approximately 3.1 mm2 in size. This is compared to the 40 nm based Bobcat which is about 4.9 mm2. Not quite half the size of Bobcat, it is still small enough for AMD to fit four of them in a design rather than the two in Bobcat. SRAM size also scales nicely going down to 28 nm, so we see a doubling of the L2 cache.