Kaveri Tech, Continued
AMD did not talk about a lot of the internal plumbing with Kaveri. We were able to glean a minimal amount of information, but I stress that it is minimal.
The Onion (coherent) bus and Garlic (non-coherent) bus were both improved over the previous generation of products. AMD did not go into great detail other than to say that bandwidth is improved. Low level changes had to be made to these busses to support HSA, but again details were left out. The memory controller was also massively reworked to support the shared memory architecture as well as provide more performance and efficiency when dealing with such loads. It looks as though it also supports memory speeds up to DDR3-2400 levels right out of the box.
Kaveri now officially supports PCI-E 3.0. This feature was actually designed into Trinity/Richland, but AMD did not spend the time or money to certify their unit for that specification. When I spoke with them last year about this they simply said it was not really worth it considering the marketplace they were focusing on. To a great degree, this is likely true. Trinity/Richland were far less likely to need the high speed interconnects that PCI-E 3.0 offered when it came to RAID controllers, PCI-E SSDs, or graphics cards. Now that it is 2014, AMD has marked off the PCI-E 3.0 checkbox for their OEM partners and have opened the door for future, higher performing FX processors utilizing the FM2+ socket infrastructure.
GCN in Kaveri
Graphics Core Next is the name of the next generation graphics architecture from AMD that was first introduced in early 2012 with the HD 7000 series of parts. This was designed from the outset to be very efficient and highly programmable. It also turned out to be very powerful as well. The GCN portions included in Kaveri are nearly identical to those in the latest “Hawaii” based R9 290 graphics cards.
Each GCN compute core features 4 x 16 wide vector units, a single scalar unit, plenty of cache, associated texture and texture store units, as well as the scheduler. A total of 128 flops/clock can be achieved with each compute core, so that adds up rather quickly when there are multiple compute cores running at 720 MHz. The big improvement is the addition of the shared, coherent unified memory feature that is the foundation of HSA.
A total of two RBEs are included in Kaveri, which gives it a fairly decent pixel fill rate as compared to previous integrated parts. This gives a total of 8 color ROPs and 32 stencil/Z ROPs. I believe this is double that of previous products from AMD and Intel.
Kaveri also supports Mantle. This should be a nice boost in overall performance in games that adopt Mantle. While Battlefield 4 will support Mantle “soon”, initial results showed approximately a 45% increase in frames per second from the standard Direct3D version to Mantle. We also saw a few other native Mantle implementations that produced impressive results in performance due to the smaller amount of draw calls in complex situations.
APUs are seemingly chock full of accelerators. These are individual units which are aimed at accelerating specific workloads. It is far more efficient in these cases to design and implement an accelerator than to utilize the multi-core CPU or the GCN architecture to handle that workload. This saves on power to a very great degree, all the while minimizing the die footprint of such a unit.
Kaveri includes a very significant and new accelerator with their TrueAudio unit. This unit contains multiple DSPs to accelerate certain audio features to improve sound quality and 3D immersion. A handful of games will be coming out in 1H 2014 that will natively support TrueAudio. If I were to characterize this part, I would say that it is very similar to what Aureal tried to accomplish with their A3D 2.0 implementation. It is a step above what Creative has done with their latest EAX 5.0 based specification as well. The real kicker here is that even though Aureal won the lawsuit brought against them by Creative, they spent so much in legal fees that it essentially bankrupted the company. Creative then swooped in and bought their IP. Then they sat on it and did absolutely nothing while relying on EAX to push good 3D audio to users. That was an absolute failure. AMD is trying to make 3D audio relevant again by introducing their TrueAudio unit. Having this native to every Kaveri APU shipped will likely help push the specification and support further than if they released a standalone sound card embracing that functionality.
TrueAudio has uses outside of gaming applications. Nuance is developing a noise reduction addition to their software that will utilize the TrueAudio DSP to accelerate operations for them. This unit apparently is quite programmable and can be used for a variety of applications.
Video playback and encoding are the two primary accelerators that have been included in APUs since day one. The VCE 2 (video coding engine) is a highly upgraded unit as compared to VCE 1. We can see in the slide below the changes between the two.
UVD 4 is the latest iteration of the Unified Video Decoder that was introduced many generations ago with AMD graphics cards. The only improvement this sees is improved error resiliency. When something is poorly encoded and contains errors, the UVD unit will not lock up and continue to show the last good frame while audio keeps moving forward. The corrupted frame will be skipped and the video will move forward in sync with audio.
AMD does not have a H.265 decoder yet, but it will be supported through the GCN units. This does expend more power than a more focused accelerator, but those hard coded accelerated units do take time to design and implement. The flexibility of the GCN architecture allows it to do work such as H.265 decode without maxing out the CPUs to keep up with the workload.
Kaveri finally fulfills the promise of a true Heterogeneous System Architecture. The shared memory space and addressing (hUMA), the ability for the GCN units to handle and assign threads as needed (hQ), and a growing software and programming ecosystem that can take advantage of the potential horsepower offered by this APU are working together to maximize the potential of this architecture.
Code complexity with HSA will diminish significantly. The use of shared memory and pointers allows the CPU and GPU to access memory without having to do copies from CPU memory to GPU memory and vice versa. Programming tools are also either available or are being developed to support HSA so that programmers do not have to veer too far away from what they are comfortable with. Java is the big target for AMD right now due to how many current applications are based off of that language. They are working closely with Oracle to make sure that Java supports HSA at a very low level. This past year Oracle joined up with the HSA Foundation.
The flexibility of HSA was also mentioned above. New codecs such as H.265/HEVC are not supported with current accelerators, but can be accelerated through OpenCL. This will be true for other upcoming standards that do not yet have accelerators designed for them, or would run more efficiently on massively parallel units rather than multi-core CPUs.
HSA is supported through software like OpenCL or C++ AMP, but some of the low level OS routines will not catch up to HSA for a while. Linux will be receiving such updates first, but it will still be a couple of years down the road after HSA is officially ratified by the Foundation.
Kaveri: A Leap
We have been learning about Kaveri for years now. Few of the details have been hidden to us, and certainly not for long. Processor Forums, editor’s days, leaks, and investor meetings have taught us a lot about what AMD wants to do with their APUs. Their goals are pretty lofty, but there is a lot of momentum swinging towards heterogeneous systems. ARM is pushing that way, NVIDIA has a big stake in GPGPU, and even Intel is pushing massively parallel computing (though in a different way).
Kaveri is a complex and potentially groundbreaking part. One of the really big strengths of the chip is that a user does not lose the performance or functionality of the graphics portion when using a separate video card. This could potentially have a big impact on applications which can leverage that piece of silicon. Think of games with lots of AI and physics computations being done on the APU while the graphics card handles only the tessellation, geometry, vertex, and pixel shading. AI and physics on an APU with shared memory is far more efficient than if running on a standalone card with its own memory. Things like collision and interaction will be faster and more organic because a program can utilize the same memory space for the CPU and GPU portions of the APU.
Ryan now takes over with the hard numbers on this APU and we get to hear his impressions of the architecture after testing it for the past few days.