Getting Away from the Cores

The Northbridge is just as important as ever, and the improvements are designed to help fully utilize the memory controller with both CPU and GPU accesses.  The northbridge contains the SRI (System Request Interface), the Crossbar, the Link Controller (connects to the GPU portion via Fusion Control Link), and the memory controller.  The memory controller interfaces with the DRAM controller, while the GPU portion also directly interfaces to the DRAM controller through a 256 bit bi-directional unit.  The FCL is 128 bit in size and allows the GPU direct access to coherent memory space, and also allows the CPU to access the dedicated frame buffer memory.  This is one of the first implementations of hUMA that we have seen.

PCI-E is the primary data path for most I/O functions.  The southbridge is included on Kabini/Temash, and it features 8 USB 2.0 ports, 2 USB 3.0 ports, and two SATA 2G/3G ports.  It has support for four 1 x PCI-E lanes for further connectivity and a 1 x 4 lane PCI-E unit for external graphics.  The display portion features eDP support (embedded Display Port) which is more efficient than the previous LVDS specification.  This can support Display Port, HDMI, and even the old VGA output.

The SOC is very tightly knit together so that whatever data is needed is quickly accessed. 

The graphics portion is again based on the GCN architecture that powers the HD 7700/7800/7900 series of graphics cards from AMD.  Kabini and Temash feature two GCN compute units, each of which features 4 x 16-wide vector units and a single scalar unit.  AMD considers this to be “128 Radeon Cores” and it is a significant improvement from the previous 80 unit VLIW5 architecture used in Brazos.  Not only do we see increased performance in 3D applications, but we see a nearly 75% increase in GPGPU applications with this newer architecture.  There are no significant changes to this unit as compared to those powering the video cards released last year.  The GPU can support up to two monitors in Eyefinity and up to a resolution of 4K x 2K.  It also contains the same VCE and UVD units of the larger chips.

The graphics unit features a single Render Back End which is comprised of four color ROPS or 16 Z/stencil ops.  Each SIMD unit has 64KB registers and the scalar unit has 8 KB of registers.  It contains a pretty large 128 KB L2 read/write cache for both 3D functions and GPGPU.  Each CU contains 16KB Vector Data L1 cache, 16 KB Scalar Data L1 cache, and 32 KB Instruction L1.  Unlike most pre-DX9 based units, modern GPUs have a significant amount of cache to improve performance and functionality.

Ryan has written extensively about GCN and its ability in both 3D and GPGPU applications.  A full overview here would be another article by itself.  The increased flexibility of the GCN architecture will help deliver some impressive numbers when it comes to GFLOPs.

Power saving is of course the ultimate goal with this particular design.  It needs to be quick, but it needs to be efficient.  Power gating, throttling, and sleep modes all help to improve the power envelope of these parts without massively impacting overall performance.  AMD mentioned in their Editor’s Day that if all power savings were turned off on these parts, battery life would be around an hour and plastic would start to melt off the casing.  Obviously these are bad things.

AMD first of all balances out the TDPs of the individual cores depending on the workload.  If something is heavily single threaded, then one core is given the lion’s share of power while the other three cores are powered down.  There are individual power monitors for the CPU, GPU, the display interface, and the FCH.  The Turbo Core Manager uses all of this data to calculate temperatures and power draw throughout the design, and optimize what units receive how much power and how high they can be clocked.  So while the entire design might be rated for 15 watts, in some instances the CPU portion might be using up to four to five watts while the other components share the remaining 10 watts.  In others the GPU might be taking up to half of the available power while everything else is clocked and gated down.

In terms of temperature, AMD has created a design that is complementary to spreading heat.  While in chips heat primarily goes “up”, which is to the heatsink, there is some spreading of heat throughout the die.  If the GPU is clocked down, it can absorb some heat from the CPUs if they are all active.  The opposite is true as well.  By using a fine grained approach to power usage and performance, AMD can more adequately control the thermals of the chip and increase overall efficiency and performance.


Wrapping it Up

Jaguar is a massive redesign of the Bobcat architecture that improves IPC in single threaded applications by up to 20%, and also increases the top clockspeed while still retaining the same thermal characteristics of the earlier part.  It also goes from a dual core part to a true quad core product.  Graphics performance has increased dramatically as well with the GCN architecture in place as well as further power and thermal optimizations which allow the product to run at faster speeds as well.

This product is intended for the sub-5 watt market all the way to 15 watts.  It is arguably the most feature packed product in this category, and the performance is class leading as well.  The core size is around the same as the latest ARM Cortex A15, but it should perform better per clock than that particular product.  Not to get overly excited though, the Cortex A15 is still more energy efficient due to the low power oriented ARM ISA.  Having said that, Temash at 3.9 to 5 watts should give a better user experience in a tablet than a comparable A15 based SOC.  We personally played DiRT Showdown on a Temash tablet and the experience was impressive.  Intel has not released their 22 nm based Silvermont products, so AMD has a nice window of opportunity to ship these advanced, low power units with next generation graphics performance and technology.

The original Bobcat architecture was very forward looking at the time, and Jaguar has added to that without sacrificing power consumption for improved performance.  The addition of GCN based graphics also opens the door to greater performance in workloads that can actually utilize hUMA based architectures and graphics based parallel computing.  Jaguar looks to be an impressive update and should allow AMD to move aggressively into the tablet market, not to mention maintain their presence in the ultra thin-and-lights and the budget notebooks.


I would highly encourage everyone to read over our two complementary articles launching today:

  1. AMD 2013 Mobile Platforms – Temash, Kabini and Richland
  2. Performance Preview – AMD Kabini Reference Notebook Tested

With those stories and this one, you should have a solid understanding of where AMD stands today on these various mobile APU markets.  Despite the continued laggard performance in the x86 space, I truly believe that AMD has responded well, and responded quickly, to the need for an ultra-mobile product with Temash and Kabini.

« PreviousNext »