As I mentioned before, the Kepler implementation on Tegra K1 is surprisingly close to the design you will find in a GeForce GT 780 Ti. The SMX unit includes 192 CUDA cores, a unified memory cache, and dedicated acceleration for tessellation, Z culling, and color ROPs. The primary differences found between the Tegra and GeForce units is a move from 16 texture units to 8 and from 8 color ROPs to 4.
Communication on the SMX has been changed up quite a bit though so a new on-chip network was needed for a power efficient implementation. Using the same communication routes used on the desktop / discrete Kepler GPUs just wouldn’t work in the SoC. The complexity of what exists on the desktop can cost a lot of power as well.
With a Kepler GPU in Tegra K1 you get all the benefits of Kepler automatically. A feature like hardware tessellation doesn’t surprise PC gamers but for mobile users and developers the feature is new and it is unique to NVIDIA. Tessellation allows you to dynamically generate geometry based on a level of detail variable usually set by screen position. This can be a savings of nearly 50x on triangle generation when compared to OpenGL ES2.0 software tessellation and significant performance improvements will be seen in specific scenarios that take advantage of it. NVIDIA showcased a couple of demos running on the Tegra K1 reference platform including a terrain map and NVIDIA’s classic Stone Giant demo – both were impressive and running well.
Geometry shading is also included with Kepler which can be utilized for cube maps, voxel rendering, and shadow volumes. Bindless textures are supported with the Tegra K1 to allow developers to access textures directly from memory. All of these features are expected on discrete GPUs, but are impressive additions to mobile graphics.
Tegra K1 supports GPU accelerated path rendering for improved text clarity and fast zooming. This feature has been a part of browsers and Android for some time, but it is good for NVIDIA to be keeping up in these key user experience areas with GPU compute.
In an architecture with a pretty limited memory bus width and low bandwidth, compression of textures and color can mean a lot. Not only used for gaming purposes, the Tegra K1 can use compression through many stages of the pipeline to improve performance as well as improve power efficiency of the platform.
These examples above show how much bandwidth NVIDIA is able to save with the GPU compression of Tegra K1. For mobile devices, saving memory bandwidth directly equates to power savings and battery life. Performance benefits won’t likely be seen until K1 is integrated into higher resolution displays where memory bandwidth could become a bottleneck.
Taking a GPU that currently resides in 200+ watt graphics processors and paring it down to fit into the mobile factors that require maximum power draw of 2 watts might seem like an impossible task, but NVIDIA was able to accomplish it with Kepler and a long bullet list of features. Rail gating, clock gating, power gating, GPU L2 cache and compression, early z culling and optimized interconnects are all at work in Tegra K1 to bring power down.
During briefings NVIDIA gave a specific example of how efficient Kepler can be. Take the GeForce 740M graphics card that utilizes two SMX units at 19 watts. First, remove 3 watts for IO and memory, 6 watts for leakage from higher voltages and you are down to 10 watts or 5 watts per SMX. If you run the voltage at 0.9v rather than the 1.1v implemented and clock down from 1.0 GHz to 500 MHz then you reach the 2 watt level that K1 needed.
NVIDIA demonstrated the GPU efficiency of Kepler in the K1 by comparing the reference platform to the iPhone 5s with the Apple A7 SoC, and the Sony Xperia Z Ultra with the Qualcomm Snapdragon S800 and Adreno 330 GPU in the new GFXBench 3.0. As the name suggests, this graphics test uses OpenGL ES3.0 and we are looking at the Manhattan 1080p off screen result below.
In both of the direct comparisons being made, the Tegra K1 is 1.5x more power efficient than the other SoCs at work. At the high performance level the K1 at 1.75W (SoC and memory) runs at the same performance as the iPhone 5s at 2.5W. At 1.5W the K1 performs the same as the Xperia at nearly 2.20W. The obvious issue with these results, other than they were run and presented by NVIDIA, is that we are only looking at single data points rather than a performance per watt curve. It is easy for a vendor to pick specific use cases where their silicon outperforms the competition, but the ability to do that for all (or most) of the device’s voltage range is much more important.