Steamroller in Slightly Better Focus
Steamroller looks to improve upon what Piledriver has brought to market. Unfortunately, for those very interested in the nuts and bolts of how this is accomplished, Mr. Papermaster is very general with this particular overview. We do not know the exact details, or changes to current structures, which will be done to implement these positive performance improvements. The first slide shows that AMD will not radically change around the overall design of Steamroller. It will still feature the shared fetch and decode portions, the two integer execution units, and the single shared floating point/MMX/SSE/AVX structure. The goals are obviously to feed the cores faster, improve single threaded IPC, and continue to push performance per watt.
The next slide gives us some decent information. First off the fetch unit will receive some significant upgrades. Branch mispredict will be improved upon, and the instruction caches will be bigger. One of the big complaints about Bulldozer was the relatively small L1 data and instruction caches. These will obviously be larger, and the latency will be improved upon. The next big improvement will be dedicated decode units for each integer execution unit. These decode units will also service the float unit, depending on thread priority. AMD estimates these changes will lead to a 30% improvement in operations per cycle.
Single core execution looks to get a boost by essentially getting data to the execution units faster. This includes better scheduling and more register resources without increasing latency. The larger L1 d-cache will not just be bigger, but will handle data cache misses better and will have major improvements in store handling. In the past AMD has talked candidly about where some of the weaknesses of Bulldozer were, and this is certainly an area that is receiving major improvements.
Steamroller will also receive a good performance per watt boost with the new design. One of the areas that are of distinct concern is that of floating point efficiency. The current design has a unit that shared, but is essentially single threaded. Even though it has enough execution units to do 2 x 128 bit SSE4 operations, if the thread does not require two operations then the other unit is left idle. With the dual decode units feeding the floating point unit, it appears as though AMD might be working on better interleaving of threads to improve efficiency and increase utilization of floating point resources. This theoretically should allow the unit to finish work more quickly and then go to sleep faster. This is obviously speculation on my part as AMD is not giving us nearly the information that we require (or at least desire). L2 cache can also be resized depending on workload, so it does appear as though portion of the L2 cache can be put into a sleep mode if it is not required. This again saves power.
One new area of interest for us when it comes to next generation architecture is that of the design tools that AMD is implementing. Taking a big page from the GPU guys, AMD is utilizing more and more automated place and route. This has advantages and disadvantages to any design. Typically hand placed designs can be more dense and because the power properties are well know, tend to be more efficient and can be clocked to higher speeds (aka custom cells). Automated place and route typically depend on “standard cell” designs which are easier to fit together, but typically take up more space, use more power, and clock much lower. The “High-Density” cell library that AMD is using for some of their graphics work is now being applied to the CPU. The example they use is that using the HD cell library decreases the size of the functional unit by 30% from the hand laid out version. AMD claims that it not only saves on die space, but is more power efficient. AMD did not comment on clocking ability though.
Finally AMD showcases their latest work with the SeaMicro folks. SeaMicro was purchased by AMD and is well known for their dense and efficient server solutions. These solutions typically utilize low TDP processors, but feature a unique interconnect fabric that is wide, powerful, and energy efficient. AMD will be introducing Opteron based units utilizing this fabric technology. The board, as shown here, is very small yet still very powerful. Eventually AMD will implement Steamroller based APUs in the Opteron environment, and these products will be a primary focus for the SeaMicro fabric technology.
AMD is obviously not standing still, and they are working to improve the architecture that was introduced with Bulldozer. AMD is doing things a bit different this time. Instead of focusing on large, complex jumps in performance featuring long time frames between architectures, they are working on smaller steps to improve not just process technology, but overall design and implementation. We do not know if this will help AMD catch up to Intel in overall performance and power consumption, but when considering the size of AMD and their R&D budget, this approach may allow them to stay more competitive without the risks of making a huge mistake with a comprehensive blank slate design. This at least allows them to see what works, what can perform better, and what simply drags the design down. With this information in hand, AMD can more quickly turn around and address issues with a particular design.
According to sources, AMD expects to introduce the first Steamroller parts in early 2H 2013.