A new architecture with GP104
We have a review of the GeForce GTX 1080 Founders Edition for you. It’s the new king. Get in here and read it. Now.
Table of Contents
- Asynchronous compute discussion
- Is only 2-Way SLI supported?
- Overclocking over 2.0 GHz
- Dissecting the Founders Edition
- Benchmarks begin
- VR Testing
- Impressive power efficiency
- Performance per dollar discussion
- Ansel screenshot tool
The summer of change for GPUs has begun with today’s review of the GeForce GTX 1080. NVIDIA has endured leaks, speculation and criticism for months now, with enthusiasts calling out NVIDIA for not including HBM technology or for not having asynchronous compute capability. Last week NVIDIA’s CEO Jen-Hsun Huang went on stage and officially announced the GTX 1080 and GTX 1070 graphics cards with a healthy amount of information about their supposed performance and price points. Issues around cost and what exactly a Founders Edition is aside, the event was well received and clearly showed a performance and efficiency improvement that we were not expecting.
The question is, does the actual product live up to the hype? Can NVIDIA overcome some users’ negative view of the Founders Edition to create a product message that will get the wide range of PC gamers looking for an upgrade path an option they’ll take?
I’ll let you know through the course of this review, but what I can tell you definitively is that the GeForce GTX 1080 clearly sits alone at the top of the GPU world.
GeForce GTX 1080 Specifications
Much of the information surrounding the specifications of the GTX 1080 were revealed last week with NVIDIA’s “Order of 10” live stream event. There are some more details we can add now to clock speeds that should paint a very interesting picture of where NVIDIA has gone with the GTX 1080 and GP104 GPU.
|GTX 980 Ti
|R9 Fury X
|up to 1000 MHz
There are two direct comparisons worth looking at with the GeForce GTX 1080. Both the GTX 980 and the GTX 980 Ti are competitors to the GTX 1080 – the GTX 980 in terms of GPU-specific placement and the GTX 980 Ti in terms of “king of the hill” single GPU consumer graphics card performance leadership. (The Titan X is obviously faster than the 980 Ti, but not by much, and its price tag puts it in a different class.)
With 2560 CUDA cores, the GTX 1080 has 10% fewer than the GTX 980 Ti but 25% more than the GTX 980. Those same ratios apply to the texture units on the cards as well, though the GTX 980 and GTX 1080 both are configured with 64 raster operators (ROPs). The GTX 980 Ti has 96 ROPs, an increase of 50%. Despite the modest advances the new GTX 1080 has over the GTX 980, and the supposed deficit it has when compared to the GTX 980 Ti, this new card has something else on its side.
The GTX 1080 will have a base clock speed of 1607 MHz and a rated Boost clock of 1733 MHz! The base clock is 60% higher than the GTX 980 Ti and 42% higher than the GTX 980 and that is clearly where the new GP104 GPU gets so much of its performance.
A quick glance at the memory specifications indicates that the move to GDDR5X (G5X) has helped NVIDIA increase performance here as well. With just a 256-bit memory bus the GTX 1080 produces 320GB/s of bandwidth via a 10 Gbps / 5.0 GHz speed, outpacing the GTX 980 by 42% yet again. The GTX 980 Ti and Titan X do have higher total memory throughputs though, with 384-bit buses measured at 336 GB/s, but NVIDIA has made improvements in the compression algorithms with Pascal that should increase effect bandwidth even above that.
The first consumer GPU we have seen built on the 16nm (or 14nm) FinFET process consists of 7.2 billion transistors but only has a rated TDP of 180 watts. That is slightly higher than the GTX 980 (165 watts) but significantly lower than the GTX 980 Ti (250 watts). After looking at performance results I think you’ll be impressed with the performance/watt efficiency improvements that NVIDIA has made with Pascal, despite the increased transistor count and clock speeds.
Pascal and GP104 Architecture – How we got the GTX 1080
How does the GTX 1080 get this level of clock speed improvement and performance uptick over Maxwell? Pascal combines a brand new process technology and a couple of interesting architecture changes to achieve the level of efficiency we see today.
One interesting change visible in the block diagram above is a shift to embedding five SMs (simultaneous multiprocessor) into a single GPC (Graphics Processing Cluster). This changes the processing ratios inside the GPU when compared to Maxwell that had four SMs for each GPC. Essentially, this modification puts more shading horsepower behind each of the raster engines of the GPC, a balance that NVIDIA found as an improvement for the shifting workloads of games.
16nm FinFET Improvements and Challenges
The first and easily most important change to Pascal is the move away from the 28nm process technology that has been in use for consumer graphics cards since the introduction of the GeForce GTX 680 back in March of 2012. Pascal and GP104 are built around the 16nm FinFET process from TSMC and with it come impressive improvements in power consumption and performance scaling.
A comment on YouTube properly summed up this migration in a way that I think is worth noting here.
Using Intel parlance, Pascal is a tick and tock in the same refresh (making up for Kepler>Maxwell being no tick and half a tock), so it's understandable that it's blowing the doors off the 980 that it replaces. -Jeremiah aka Critical Hit
Who knew such interesting commentary could come from YouTube, right? But it is very much the case that the GPU industry had some “pent up” ability to scale that was being held back by the lack of a step between 28nm and 16/14nm process nodes. (20nm just didn’t work out for all parties involved.) Because of it, I think most of us expected Pascal (and in theory AMD’s upcoming Polaris architecture) to show accelerated performance and efficiency with this generation.
Migrating from 28nm to 16nm FinFET is not a simple copy and paste operation. As NVIDIA’s SVP of GPU Engineering, Jonah Alben, stated at the editor’s day earlier this month, “some fixes that helped with 28nm node integration might actually degrade and hurt performance or scaling at 16nm.” NVIDIA’s team of engineers and silicon designers worked for years to dissect and perfect each and every path through the GPU in an attempt to improve clock speed. Alben told us that when Pascal engineering began optimization, the Boost clock was in the 1325 MHz range, limited by the slowest critical path through the architecture. With a lot of work, NVIDIA increased the speed of the slowest path to enable the 1733 Boost clock rating they have on the GTX 1080 today.
Optimizing to this degree allows NVIDIA to increase clock speeds, increase CUDA core counts and increase efficiency on GP104 (when compared to GM204) all while moving the die size from 398 mm2 to 314 mm2.
Simultaneous Multi-Projection - A new part of the PolyMorph Engine
The only true addition to the GPU architecture itself is the inclusion of a new section to the PolyMorph Engine, now branded as version 4.0. The Simultaneous Multi-Projection block is at the end of the geometry portion of the pipeline but before the rasterization step. This block creates multiple projection schemes from a single geometry stream, up to 16 of them, that share a single viewpoint. I will detail the advantages that this feature will offer for gamers in both traditional and VR scenarios, but from a hardware perspective, this unit provides impressive functionality.
Software will be able to tell Pascal GPUs to replicate geometry in the stream up to 32 times (16 projections x 2 projection centers) without overhead affecting the software as that geometry flows through the rest of GPU. All of this data stays on chip and is hardware accelerated, and any additional workload that would go into setup, OS handling or geometry shading is saved. Obviously all of the rasterized pixels that are created by the multiple projections will have to be shaded, so that compute workload won’t change, but in geometry heavy situations the performance improvements are substantial.
Displays ranging from VR headsets to multiple-monitor Surround configurations will benefit from this architectural addition.
Updated Memory - GDDR5X and New Compression
If you thought the 28nm process on the GTX 980 and GM204 was outdated, remember that GDDR5 memory was first introduced in 2009. That is what made AMD’s move to HBM (high bandwidth memory) with the Fiji XT GPU so impressive! And while NVIDIA is using HBM2 for the GP100 GPU used in high performance computing applications, the consumer-level GP104 part doesn’t follow that path. Instead, the GTX 1080 will be the first graphics card on the market to integrate GDDR5X (G5X).
GDDR5X was standardized just this past January by JEDEC so it’s impressive to see an implementation this quickly with GP104. Even though the implementation on this GPU runs at 5.0 GHz, quite a bit slower than the 7.0 GHz the GTX 980 runs at with GDDR5 (G5), it runs at double the data rate, hitting 10 Gbps of transfer. The result is a total bandwidth rate of 320 GB/s with a 256-bit bus.
NVIDIA talked quite a bit about the design work that went into getting a GDDR5X memory bus to operate at these speeds, throwing impressive comparisons around. Did you know that NVIDIA’s new memory controller has only about 50 ps (picoseconds) to sample data coming at this speed, a time interval that is so small, light can only travel about half an inch in its span? Well now you do.
I am not underselling the amount of work the memory engineers at NVIDIA went through to implement G5X at these speeds, including the board and channel design necessary to meet the new tolerances. Even better, NVIDIA tells us that the work they put into the G5X integrate on the GTX 1080 will actually improve performance for the GTX 1070 with G5 memory.
NVIDIA has also improved on the memory compression algorithms implemented on the GPU to improve effective memory bandwidth through the product. Compressing data with a lossless algorithm as it flows inside the GPU and in and out of GPU memory reduces the amount of bandwidth required for functionality across the board. It’s an idea that has been around for a very long time, though as algorithms improve, we see it as an additive change to GPU memory interface performance.
Maxwell introduced a 2:1 delta color compression design that looked at pixel color values in a block and stored them in a few of fixed values as possible, using offsets from those fixed values to lower the size of the data to be stored. Pascal improves on the 2:1 ratio algorithm to enable it to be utilized in more situations, but also adds support for a 4:1 and 8:1 option. The 4:1 algorithm looks for blocks where the pixel changes are much smaller, and can be represented by even less data on the offset. And if you are lucky enough to utilize the 8:1 algorithm, it combines the 4:1 option with the 2:1 to look for blocks that share enough data that they can be compressed against each other.
These images above show a screenshot from Project CARS and compares memory compression from Maxwell to Pascal. Every pixel that is compressed in at least a 2:1 delta color algorithm is color pink; Pascal has definitely improved.
In general, compression algorithm changes over Maxwell give GP104 an effect 20% increase in memory bandwidth over the GTX 980. Combining that with the 40% improvement in rated bandwidth between the two cards and you have a total improvement of 1.7x on effective memory performance between the GTX 1080 and the GTX 980.