heterogeneous Uniform Memory Access

AMD shows a little more detail on their upcoming hUMA based products

 

Several years back we first heard AMD’s plans on creating a uniform memory architecture which will allow the CPU to share address spaces with the GPU.  The promise here is to create a very efficient architecture that will provide excellent performance in a mixed environment of serial and parallel programming loads.  When GPU computing came on the scene it was full of great promise.  The idea of a heavily parallel processing unit that will accelerate both integer and floating point workloads could be a potential gold mine in wide variety of applications.  Alas, the promise of the technology did not meet expectations when we have viewed the results so far.  There are many problems with combining serial and parallel workloads between CPUs and GPUs, and a lot of this has to do with very basic programming and the communication of data between two separate memory pools.

CPUs and GPUs do not share common memory pools.  Instead of using pointers in programming to tell each individual unit where data is stored in memory, the current implementation of GPU computing requires the CPU to write the contents of that address to the standalone memory pool of the GPU.  This is time consuming and wastes cycles.  It also increases programming complexity to be able to adjust to such situations.  Typically only very advanced programmers with a lot of expertise in this subject could program effective operations to take these limitations into consideration.  The lack of unified memory between CPU and GPU has hindered the adoption of the technology for a lot of applications which could potentially use the massively parallel processing capabilities of a GPU.

The idea for GPU compute has been around for a long time (comparatively).  I still remember getting very excited about the idea of using a high end video card along with a card like the old GeForce 6600 GT to be a coprocessor which would handle heavy math operations and PhysX.  That particular plan never quite came to fruition, but the idea was planted years before the actual introduction of modern DX9/10/11 hardware.  It seems as if this step with hUMA could actually provide a great amount of impetus to implement a wide range of applications which can actively utilize the GPU portion of an APU.

Click here to continue reading about AMD's hUMA architecture.

hUMA, Slightly More Detailed

The idea behind hUMA is quite simple; the CPU and GPU share memory resources, they are able to use pointers to access data that has been processed by either one or the other, and the GPU can take page faults and not rely only on page locked memory.  Memory in this case is bi-directionally coherent, so coherency issues with data in caches which are later written to main memory will not cause excessive waits for either the CPU or GPU to utilize data that has been changed in cache, but not yet written to main memory.

Current APUs work by partitioning off a chunk of main memory and holding onto it for dear life.  Some memory can be dynamically allocated, depending on the architecture we are dealing with.  Typically upon boot the integrated graphics will partition off a section of memory and keep it for its own.  The CPU cannot address that memory, and in fact it appears gone for all intents and purposes.  hUMA will change this.  The entire memory space will be available to both the CPU and GPU, and they end up sharing this resource just as another CPU with full coherency would with the primary CPU.  This not only applies to the physical memory, but also to the virtual memory space.

Standalone GPUs can benefit from HSA, but not to the extent of an APU.  A dedicated GPU will have its own attached memory as well as shared memory with the main CPU.  Due to the communication latency issues of writing from main memory to the video card’s memory, it is not nearly as seamless as what an APU can accomplish.  It makes sense that such a setup would benefit a solution with a shared memory pool as well as a shared memory controller.  Everything else involves more latency and differing amounts and types of memory.

As seen in the slides, AMD has covered the very high level design features of hUMA.  The first product that will feature this architecture will be the Kaveri based APUs, which will be introduced in 2H 2013.  These are Steamroller based parts with a GCN based graphics portion.  AMD is not giving more specific guidance about when this product will be released, but from all indications it will be more of a Q4 product in terms of availability.  Something of note is that the recently released Kabini processor does not fully utilize hUMA.  Though the APU features the latest Jaguar low power CPU core and the latest generation GCN based graphics portion, it is not a fully hUMA enabled part.  It appears to have the same basic limitations as the previous Llano and current Trinity APUs when it comes to memory allocation and sharing.

The greatest advantage of hUMA is likely that of the ease of programming as compared to current OpenCL and CUDA based solutions.  Often functions have to be programmed twice, once for the GPU and once for the CPU, and then results have to be copied over from the individual memory pools so the other unit can read the results attained by the other.  This is not only a lot of extra work, but the knowledge needed to adequately do this was typically reserved for elite level programmers with a keen understanding of the two different programming models.

Here we can see exactly how performance with CPUs and GPUs have compared in terms of pure GFLOPs.  Parallel computing, while not perfect for every workload, has a lot of potential if programming is implemented effectively.

Serial and parallel workloads can be much more adequately assigned to the hardware units that can address them best.  In heavily parallel loads, a GPU can see a 75% reduction in power usage when compared to a traditional CPU doing the same amount of work.  On the other hand, heavily serial work will utilize a CPU to a much greater extent, and therefore take less time and power to achieve the same result as a GPU trying to compute the same load.  By implementing this very key piece of technology, AMD and its HSA partners are hoping to further heterogeneous computing.  This technology is being shared with members of the HSA group with the hope that it will become the standard for heterogeneous systems, much like AMD-64 became the standard for x86 computing.