Dynamic Parallelism, Hyper-Q, GPUDirect

Kepler Memory Subsystem

The L2 cache on Kepler has been doubled to 1536K and is the primary point of unification between the SMX units and their processing cores. Because of the increase in capacity the available bandwidth has also doubled as has concurrency allowing twice as many cache hits to be processed on every clock.

Of course the memory system on Kepler supports ECC memory in much the same way as Fermi though there are some changes in how verifications are performed that reduce the performance penalty for enabling error correction.

Atomic memory operations are improved on Kepler and in fact can result in as much as a 10x performance increase for compare and swap operations. Atomic memory operations are important for parallel computing because they allow concurrent operating threads to evaluate and modify data structures that are shared without the need to halt other threads. The atomic units themselves have actually been moved on the die to minimize data transfer during these operations resulting in a system in which atomic ops are nearly as fast as simple loads.

A new shuffle instruction has been introduced with Kepler that allows threads running in the same warp to share data by direct read rather than having to perform a store and load operation.

Several shuffle parameters exist to allow for different patterns like broadcast, where all threads read one thread’s value, shift (offset memory reads) or butterfly XOR. The shuffle instruction will have to be optimized for by programmers though NVIDIA is claiming to see a 6% performance increase as warp threads would often no longer need to place data in shared memory.

While many of you might think that use of texture units to be a legacy case for GPGPU computing, there are still many cases in which using textures can greatly ease development and improve performance for developers. With Kepler you might remember the introduction of “bindless” textures on GK104 though at the time there wasn’t a lot of use for them though we were promised a basically infinite number of textures for graphics. The same feature is used on GK110 to allow for thousands of texture IDs and for texture data to finally be able to reside in arrays.

Dynamic Parallelism

Dynamic parallelism is a new feature on Kepler that allows the GPU to self-schedule and self-generate new workloads based on programmer parameters without the need for intervention with the CPU. While Fermi and all previous GPUs for HPC were very good at scaling with large workloads when all the details and information were provided up front, this was something always done by the host computer’s CPU. The GPU portion of the program would run, report results back to the CPU to be handled by the application and if any more data was needed additional workloads were sent to the GPU with all required parameters.

With Kepler, any kernel has the capability to launch another workload and set the required streams, events and dependencies without any intervention from the CPU. While this might sound simple this ability in fact pushes the GPU in the direction of being a “central” processor with logic in place for queue management, thread reconstruction, prioritization, etc. The immediate impact is that developers can now create recursive and data-dependent algorithms that will run solely on the GPU more efficiently and more quickly. This frees up the CPU of the server for either other tasks or possibly even enables developers to select a less powerful (and less expensive) CPU for the job.

This feature should allow a larger variety of parallel workloads to be converted to GPU processing including those with nested loops or basic required serial control tasks. NVIDIA hasn’t really talked yet about the performance of these operations but the general consensus has been that GPUs are poor replacements for general purpose CPUs when it comes to serial tasks – but at what point the power of a 7.1 billion transistor chip can emulate basic CPU tasks is up in the air.

NVIDIA did offer up one specific scenario that would see benefits:

One example would be dynamically setting up a grid for a numerical simulation – typically grid cells are focused in regions of greatest change, requiring an expensive pre-processing pass through the data. Alternatively, a uniformly coarse grid could be used to prevent wasted GPU resources, or a uniformly fine grid could be used to ensure all the features are captured, but these options risk missing simulation features or “over-spending” compute resources on regions of less interest.

With Dynamic Parallelism, the grid resolution can be determined dynamically at runtime in a data-dependent manner. Starting with a coarse grid, the simulation can “zoom in” on areas of interest while avoiding unnecessary calculation in areas with little change.

Though this could be accomplished using a sequence of CPU-launched kernels, it would be far simpler to allow the GPU to refine the grid itself by analyzing the data and launching additional work as part of a single simulation kernel, eliminating interruption of the CPU and data transfers between the CPU and GPU.

The above example illustrates the benefits of using a dynamically sized grid in a numerical simulation. Too coarse and too fine grids are compared to the multi-resolution grid on the right showing the advantage of allowing the GPU to dynamically assign work based on the areas with greatest variation.

In order to support the ability for NVIDIA’s Kepler architecture to launch additional workloads a new Grid Management Unit was built that handles the incoming workloads from the host system as well as the algorithmically created workloads and balances accordingly.

The CWD actively communicates with the new Grid Management Unit so that it can pause the creation of new grids and hold suspended workloads until needed. Each SMX can then communicate back to the GMU for new work to be placed into queue and re-ordered as required.


While in a gaming scenario it is pretty simple to make sure the GPU always has enough data to work on and stay busy, previous generations of GPU had problems achieving this on the GPGPU side. While the Fermi architecture could support 16 concurrent kernel launches all of them ended up in the same hardware work queue essentially putting them in a semi-serial state. The new Kepler GK110 improves this dramatically by offering a 32-connection (or 32 work queue) hardware managed CUDA Work Distributor (CWD). The CWD can support running kernels from different CUDA streams (applications or sub-applications) or even from multiple threads within a process and because they are independent, operations in one stream will not delay the execution of another stream.

The Hyper-Q feature will allow NVIDIA GPUs to be better utilized and thus improve efficiency at the application level. It also means that high performance CPUs on the server side can be better utilized instead of being bottlenecked by the single hardware work queue.


First introduced last year, NVIDIA GPUDirect gives access to GPU memory to third party devices including NICs and SSDs without CPU-side data buffering. The goal is to lower latency for both intra-system and extra-system communications including server to server while reducing the memory load on the host system for other purposes.

We have already seen a few vendors on hand at the GPU Technology Conference to discuss the potential benefits of the technology including SSD giant FusionIO.

Closing Thoughts

There is still much for us to learn about the new GK110 chip and how it will perform in the HPC market. With a stated performance level of 1 TFlop of double precision computing power, nearly twice that of the GF100 and GF110 based professional GPUs, there is little doubting that NVIDIA has constructed a performance powerhouse. At what cost is the question – both in terms of die size and yield AND for the all-mighty dollar. We have seen GK104 and its ability to turn the PC gaming on its side with market-leading power efficiency while still being the fastest and it is possible that for this workspace NVIDIA’s GK110 will prove to be just as efficient and powerful.

I know many of our readers are wondering what GK110 means for gaming and the potential for a GeForce derivative. As of today NVIDIA is strictly holding GK110 for the Tesla brand and HPC market and in fact hasn’t even really touched on the Quadro product line. The chances of seeing a GK110 enthusiast class GPU seem pretty small with all the commotion about yields and availability of wafer starts at TSMC; NVIDIA would much rather sell a GK110 as a $6,000-10,000 Tesla card than as a $500-1,000 consumer card where reviewers and enthusiasts might revert to the “big and hot” sentiment from Fermi’s first release. However, should AMD suddenly find a new incredibly powerful GPU in a file folder in Austin, TX that they want to release to best GK104, NVIDIA at least has an emergency backup plan with GK110.

« PreviousNext »