AMD lines up Llano
Today we get our first official release of the AMD Llano APU in the form of a mobile reference platform. Can this technology finally put AMD on par with Intel?
2006. That was the year where the product we are reviewing today was first consummated and the year that AMD and ATI merged in a $5.4 billion deal that many read about scratching their heads. At the time the pairing of a the 2nd place microprocessor company with the 2nd place graphics technology vendor might have seemed like an odd arrangement even with the immediate benefit of a unified platform of chipset, integrated graphics and processor to offer to mobile and desktop OEMs. In truth though, that was a temporary solution to a more long term problem that we now know as heterogeneous computing: the merging not just of these companies but all the computing workloads of CPUs and GPUs.
Five years later, and by most accounts more than a couple of years late, the new AMD that now sans-manufacturing facility is ready to release the first mainstream APU, Accelerated Processing Unit. While the APU name is something that the competition hasn’t adopted, the premise of a CPU/GPU combination processing unit is not just the future, it is the present as well. Intel has been shipping Sandy Bridge, the first mainstream silicon with a CPU and GPU truly integrated together on a single die since January 2011 and AMD no longer has the timing advantage that we thought it would when the merger was announced.
For sanity sake, I should mention the Zacate platform that combines an ATI-based GPU with a custom low power x86 core called Bobcat for the netbook and nettop market that was released in November of 2010. As much as we like that technology it doesn’t have the performance characteristics to address the mainstream market and that is exactly where Llano comes in.
AMD Llano Architecture
Llano’s architecture has been no secret over the last two years as AMD has let details and specifications leak at a slow pace in order to build interest and excitement over the pending transition. That information release has actually slowed this year though likely to reduce expectations on the first generation APU with the release of the Sandy Bridge processor proving to be more potent than perhaps AMD expected. And in truth, while the Llano design as whole is brand new all of the components that make it up have been seen before – both the x86 Stars core and the Radeon 5000 series-class have been tested and digested on PC Perspective for many years.
For today’s launch we were given a notebook reference platform for the Llano architecture called "Sabine". While the specifications we are looking at here are specific to this mainstream notebook platform nearly all will apply to the desktop release later in the year (perhaps later in the month actually).
The platform diagram above gives us an overview of what components will make up a system built on the Llano Fusion APU design. The APU itself is made up 2 or 4 x86 CPU cores that come from the Stars family released with the Phenom / Phenom II processors. They do introduce a new Turbo Core feature that we will discuss later that is somewhat analogous to what Intel has done with its processors with Turbo Boost.
There is a TON of more information, so be sure you hit that Read More link right now!!
A large portion of the chip is of course the "Radeon Core Array" or the GPU-based SIMD units that will handle the graphics computing tasks and GPU-based portions of the heterogeneous software. This is a Direct X 11 class GPU though with obviously fewer stream processors at a lower frequency than we have seen in discrete cards. A new UVD (unified video decoder) is included for improved visual quality and efficiencies.
The memory controller on the APU is a dual-channel DDR3 design that has been redesigned quite a bit in order to improve performance on the combines CPU/GPU workload. On discrete graphics cards even low-end GPUs will have access to hundreds of GB/s of bandwidth while on the Llano design the entire chip has less than 30 GB/s for all tasks. We will go over some of the physical and architectural changes a bit later.
The chipset for the Sabine Llano platform is being referred to as the Fusion Controller Hub and will come in to flavors: A70M and A60M. The higher end option will include integrated support for USB 3.0 ports as well as SATA 6G connectivity and some general purpose PCIe ports.
This labeled diagram of the Llano APU shows the die space given to each of these different components. The array of graphics processing units dominates the design taking up about 50% of the space; a fact that AMD likes to point out in comparison to the ~25% on Intel’s Sandy Bridge. The four x86 CPU cores don’t take up nearly as much physical space if you don’t include the hefty 4MB of L2 cache. The DDR3 memory controller is other dominant physical feature followed by the PCIe channels and display connections at the bottom of the image.
I mentioned earlier that the memory controller had gone through some changes with the Llano design in order to attempt to make up for the memory bandwidth deficiencies seen moving from a discrete controller to an integrated one. Mike Goddard of AMD, when speaking at the Llano Tech Day in Abu Dhabi, described a "Radeon Memory Bus" that allowed the GPU SIMD array to access system memory at a "very high bandwidth" and that is given priority access to system memory. The fact is that memory bandwidth is the single biggest bottleneck for integrated graphics performance on processors found in cell phones, notebooks and desktops. Graphics performance will scale nearly linearly with memory bandwidth increases and the first company to really figure this problem out will take a dramatic lead. Even with Llano, it still hasn’t happened as no matter how much "priority" is given to the GPU for memory access, you are still limited to the 29.6 GB/s that the dual-channel DDR3 memory controller can provide.
The "Fusion Compute Link" provides a way for the GPU portion of the APU access memory shared with the CPU to allow for improved performance on applications that use coherent memory. OpenCL and other GPGPU applications can benefit quite a bit from hardware that doesn’t need to spend time copying data around the APU and this internal pathway allows prevents that in some cases. There is no shared cache between the CPU and GPU portions of the APU though which is in contrast to the shared L3 cache on the Sandy Bridge processor from Intel.
The x86 CPU cores on the Llano APU are based on the same "Stars" architecture as the current generation of Phenom processors though with some minor tweaks to improve the IPC (instructions per clock) performance by ~6%. These are the first Stars cores built on the 32nm process technology at GlobalFoundries so there is a bit more question about their performance and efficiency. The target TDPs for the mobile market are 35W and 45W while the desktop market will see at least 65W and 100W versions later in the year while the CPU frequencies will scale from 1.4 GHz to 2.9 GHz with the lower end finding its way into notebooks.
The memory controller on the Llano APU is likely the most modified portion of the design. With a maximum notebook bandwidth of only 25.6 GB/s and a max of 29.8 GB/s on the desktop designs, AMD claims that the GPU on the Llano chip still sees a 4x bandwidth increase over previous generations. Considering AMD’s previous generation was a chipset-based integrated graphics solution this statistic doesn’t sound nearly as impressive though without the reduced latency, power and smaller footprint associated with Llano it is a drastic improvement for mobile system designers. AMD claims of "discrete level graphics on a chip" do live up to the claim but without a doubt the memory bandwidth constraints of standard CPU-class memory controllers are still holding graphics technology back.
An interesting question was brought up during the briefing about the idea of sideband memory, dedicated memory for the integrated graphics on the APU similar to what we saw on some previous AMD platform motherboards and the Xbox 360 gaming console. AMD said that there was no option for that in the current APU design as it would require a separate memory controller for the GPU and thus a much larger die, sacrificing many of the benefits of an APU to begin with.
On the platform diagram you might have seen that the Llano APU has 24 lanes of PCI Express 2.0 on-board – well it actually has 32 lanes! The catch is that 8 of those are used for internal general purpose communication leaving 24 for use by the platform. Only two of the sets of 8 are capable of handling discrete graphics solutions though so you can run a single x16 connection for a single graphics card or a pair of x8 connections for multi-GPU configurations. Honestly though, if you are going to try and run CrossFire on the Llano platform you are completely missing the point – just buy a Phenom II system instead.
AMD Turbo Core Technology
After the first generation of Turbo Boost technology on the Intel Nehalem processors it was obvious that AMD needed to offer a similarly implementation on its processors to stay current. The theory of being able to combine a multi-core processor at lower frequencies and a single-core processor at higher frequencies into a single TDP has really made the consumer’s life much better.
As we have come to see over the last few years with the changing workloads on processors, power consumption and active core count varies quite a bit based on the task the PC is focused on at the time. The above diagram that AMD created gives us a general of view of how web, productivity, 3D creation and video creation workloads affect the active CPU count. You can see for the web and productivity scenarios all four cores are used less than a few percentage of the time and even two cores are used at most 20% of the time. When we get into 3D and video production though the capability of software to take advantage of multiple cores expands and 3-4 cores are used nearly 50% of the time during video creation.
With this power consumption and core utilization information it is easy to see then why finding a way to take advantage of the TDP headroom is so essential to designing the most efficient processor.
AMD’s method to monitor and take advantage of this headroom is different than the analog method that Intel has integrated on its processors. AMD Turbo Core actually digitally measures the activity of the CPU to estimate power consumption / TDP being used on a per core basis with integrated power monitoring logic and then passes that information to the APU north bridge. The NB sums all the power and TDP information and passes it to a third P-state manger logic portion that dithers in order to stay within the pre-determined TDP of the APU.
AMD’s version of Turbo differs from Intel’s by being a digitally measured activity source that then has very specific power steppings. The Turbo Mode on Llano will thus be much more reliable and consistent processor to processor than Intel’s Turbo Boost Technology that relies on analog measurements and even ambient temperature that will vary from system to system and chip to chip. As a reviewer, the consistency is nice but there are definitely advantages from Intel’s stance that allows each piece of silicon to theoretically meet its own peak performance.
The Turbo Core technology is CPU/GPU aware and will adjust based on the state of the x86 cores and the SIMD Radeon Cores. In the above case the CPU will have an increased power budget since the GPU is idle and will be allowed to run at a faster than stock frequency.
When the GPU is running with heavy activity on the system it will be given a priority over the CPU which could actually be limited by the total TDP of the processor and the TDP being consumed by the GPU component.
With a lightly loaded GPU, the SIMD still gets priority though the CPU will have more than enough TDP headroom to hit its default clock speed or a bit higher. This all depends on how "light" the GPU work load actually is.
In this case where there are both heavy CPU and GPU loads on the Llano processor the GPU is still given a priority take on the TDP. If the states were not adjusted on the processor then the CPU TDP total request would exceed the total TDP for the chip and is obviously a problem. In that case the CPU would be artificially lowered in clock speed by the embedded technology in order to keep the chip within the power budget it was built with.
What makes this Turbo Core technology both interesting and frustrating is that it is completely independent of the operating system constraints. In fact, our typical frequency monitoring applications like CPUZ and SiSoft Sandra don’t even show a frequency alteration from the base clock speed (in our case of 1.5 GHz on the mobile platform) which makes it hard to see if the technology is even working. The only "proof" we have at this point is performance data that shows how CPU-based applications like CineBench scale from architecture to architecture. More on that later, but note that AMD has promised us a tool soon that will allow consumers to monitor the Turbo Core state of their Llano APU.