Our first DX12 Performance Results
We got a couple of days with the new Futuremark 3DMark API Overhead Feature Test. How do NVIDIA and AMD fare?
Late last week, Microsoft approached me to see if I would be interested in working with them and with Futuremark on the release of the new 3DMark API Overhead Feature Test. Of course I jumped at the chance, with DirectX 12 being one of the hottest discussion topics among gamers, PC enthusiasts and developers in recent history. Microsoft set us up with the latest iteration of 3DMark and the latest DX12-ready drivers from AMD, NVIDIA and Intel. From there, off we went.
First we need to discuss exactly what the 3DMark API Overhead Feature Test is (and also what it is not). The feature test will be a part of the next revision of 3DMark, which will likely ship in time with the full Windows 10 release. Futuremark claims that it is the "world's first independent" test that allows you to compare the performance of three different APIs: DX12, DX11 and even Mantle.
It was almost one year ago that Microsoft officially unveiled the plans for DirectX 12: a move to a more efficient API that can better utilize the CPU and platform capabilities of future, and most importantly current, systems. Josh wrote up a solid editorial on what we believe DX12 means for the future of gaming, and in particular for PC gaming, that you should check out if you want more background on the direction DX12 has set.
One of DX12 keys for becoming more efficient is the ability for developers to get closer to the metal, which is a phrase to indicate that game and engine coders can access more power of the system (CPU and GPU) without having to have its hand held by the API itself. The most direct benefit of this, as we saw with AMD's Mantle implementation over the past couple of years, is improved quantity of draw calls that a given hardware system can utilize in a game engine.
Draw calls are, in a concise way of putting it, a request from the CPU (and the game engine running on it) to draw and render an object. There are typically thousands of draw calls being placed every frame in a modern game but each of those requests adds a level of overhead to the system, limiting performance in some extreme cases. As that draw call count rises, game engines can become limited by that API overhead. New APIs like Mantle and DX12 reduce that overhead by giving the developers more control. The effect is one clearly shown by Stardock and the Oxide Engine – a game without draw call overhead limits can immediately, and drastically, change how a game functions and how a developer can create new and exciting experiences.
Click to Enlarge
This new feature test from Futuremark, which will be integrated into an upcoming 3DMark release, measures API performance by looking at the balance between frame rates and draw calls. The goal: find out how many draw calls a PC can handle with each API before the frame rate drops below 30 FPS.
At a high level, here is how the test works: starting with a small number of draw calls per frame, the test increases the number of calls in steps every 20 frames until the frame rate drops below 30 FPS. Once that occurs, it keeps that draw call count and measures frame rates for 3 seconds. It then computes the draw calls per second (frame rate multiplied by draw calls per frame) and the result is displayed for the user.
Click to Enlarge
In order to ensure that the API is the bottleneck in this test, the scene is built procedurally with unique geometries that have an indexed mesh of 112-127 triangles. There is no post-processing and the shaders are very simple to make sure the GPU is not a primary bottleneck.
There are three primary tests the application runs through for all hardware, and a fourth if you have Mantle-capable AMD hardware. First, a DirectX 11 pass is done in a single-threaded method where all draw calls are made from a single thread. Another DX11 pass is made in multi-threaded method where all draw calls are divided evenly between a number of threads equal to one less than the number of addressable cores. That balance leaves one dedicated core for the display driver.
The DX12 and Mantle paths in the feature test are, of course, multi-threaded and utilize all cores available. They divide the draw calls even between the total thread count.
First 3DMark API Overhead Feature Test Results
Our test system was built around the following hardware:
- Intel Core i7-5960X
- ASUS X99-Deluxe
- 16GB Corsair DDR4-2400
- ADATA SP910 120GB SSD
The GPUs we used for this short feature test are the reference NVIDIA GeForce GTX 980, an ASUS R9 290X DirectCU II, the MSI GeForce GTX 960 100ME and a Sapphire R9 285 Tri-X. Driver revision for NVIDIA hardware was 349.90 and for AMD we used 15.200.1012.2.
For our GTX 980 and R9 290X results, you'll see a number of scores. The Haswell-E processor was run in its stock state (8 cores, HyperThreading on) to get baseline numbers but we also started disabling cores on the CPU in order to get some idea of the drop off as we reduce the amount of processor horsepower available to DirectX 12. As you'll no doubt see, six cores appears like it will be plenty to maximize draw call capability.
Let's digest our results.
Click to Enlarge
First on the bench is the GeForce GTX 980 and the results are immediately impressive. Even using the best-case for DirectX 11 multi-threading, our system can only handle 2.62 million draw calls per second, just over 2x the score from the single-threaded DX11 result. However, DX12 sees a substantial increase in efficiency, reaching as high as 15.67M draw calls per second, which is an increase of nearly 6x! While you should definitely not expect to see 6x improvements in gaming performance when DX12 titles begin to ship late this year, the additional CPU headroom that the new API offers means that developers can be beginning planning next-generation game engines accordingly.
For our core count reduction, we see that 8 cores with HyperThreading, 8C with no HT and 6C without HT all result in basically the same maximum draw call throughput. Once we drop to 4C, we decrease the peak draw call rate by nearly 24%. A move to a dual-core system falls to 7.22M draw calls per second, resulting in another 74% drop. Finally, at 1-core, the draw calls hit only 4.23M per second. We will still need to test other CPU platforms to see how they handle both CPU core and CPU clock speed scaling but it appears that even high end quad-core rigs will have more than enough performance headroom to stretch DX12's legs.
Click to Enlarge
Our results with the Radeon R9 290X in the same platform look similar. We see a peak draw call rate of 19.12M per second on DX12 but an even better result under Mantle, hitting 20.88M draw calls per second. That shouldn't surprise us: Mantle was written specifically for the AMD GPU architecture and drivers while DX12 has to be more agnostic to function on AMD, Intel and NVIDIA GPU hardware. Clearly the current implementation of drivers from AMD is doing quite well, besting the maximum draw count rate of the GTX 980 by 4M per second or so. That said, comparisons across GPU platforms at this point is less relevant than you might think. More on that later.
DX12 draw call performance remains basically the same across 8C with HT on, 8C and 6C testing, but it drops by about 33% with the move to a quad-core configuration. On Mantle, we do see a small but measurable 11% drop going from 8-cores to 6-cores but with it is also the only result that scales UP when given the full 8-cores on the Core i7-5960X.
Interestingly, AMD shows little to no scaling between the DX11 single threaded and DX11 multi-threaded scores with the API Overhead Feature Test, which gives credence to the idea that AMD's current driver stack is not as optimized for DX11 gaming as it should be. The DX12 results are definitely forward looking and things could shift in that area, but the DX11 results are very important to gamers and enthusiasts today – so these are results worth considering.
I also did some testing with a couple of more mainstream GPUs: the GTX 960 and the R9 285. The results here are more than a bit surprising:
Click to Enlarge
The green bar is the stock performance of our platform with the GTX 980, the blue bar is the stock GTX 960, but the yellow bar in the middle shows the results with a reasonably overclocked GTX 960 card. (We hit 1590 MHz peak GPU clock and a 2000 MHz memory clock.) At stock settings, the GTX 960 shows a 60% drop from the GTX 980 when it comes to peak draw calls; that's not totally unexpected. However, with a modest overclock on the mainstream card, we were able to record a DX12 draw call rate of 15.36M, only 2% slower than the GTX 980!
Now, clearly we do not and will never expect the in-game performance of the GTX 980 and GTX 960 to be within a margin of 2%, even with the latter heavily overclocked. No game available today shows that kind of difference – in fact we would expect the GTX 960 to be about 60-70% slower than the GTX 980 in average frame rates. Exactly why we see this scale so high with the overclocked GPU is still an unknown – we have asked Microsoft and Futuremark for some insight. What it does prove is that the API Overhead Feature Test should not be used to compare the performance of a GeForce and Radeon GPUs to any degree; if the differences in performance inside NVIDIA's own GPU stack can't match up with real-world performance, then it is very unlikely that competing architectures will fare better.
Click to Enlarge
Of course we ran the Radeon R9 285 through the same kind of comparison – stock and then overclocked. In this case we did not see the drastic increase in draw call rate with the overclocked R9 285 but we do see the R9 290X and R9 285 resulting a score within 5% of one another. Again, these two GPUs definitely have different real-world performance metrics that are further apart than 5%, proving the above point once again.
And how could we let a test like this pass us by without testing out an AMD APU?
Click to Enlarge
The DX11 MT results refused to complete in our testing, but we are working with pre-release drivers, pre-release operating systems and an unfinished API, so just this one hiccup is actually a positive outcome. Moving from DX11 single threaded results to what you get with both DX12 and Mantle, the A10-7850K APU benefits from a 7.8x increase in draw call handling capability. That should improve game performance for properly written DX12 applications tremendously, and do so on a platform that desperately needs it.
Initial Thoughts
Though minimal in quantity compared to the grand scheme of things we want to test with, the results we are showing here today paint a very positive picture about the future of DirectX 12. Since the announcement of Mantle from AMD and its subsequent release in a couple of key titles, the move to an API with less overhead and higher efficiency has been clamored for by enthusiasts, developers and even hardware vendors. Microsoft stepped up the plate, willing to sacrifice so much of what made DirectX a success the past to pave a new trail with DirectX 12.
Futuremark's new 3DMark API Overhead Feature Test proves that something as fundamental as draw calls can be drastically improved upon with forward thinking and a large dose of effort. We saw improvements in API efficiency as high as 18-19x with the Radeon R9 290X when comparing DX12 and DX11 results and while we definitely won't see that same kind of outright gaming performance with the new API, it gives developers a completely new outlook on engine development and integration. Processor bottlenecks that users didn't even know existed can now be pushed aside to stretch the bounds of what games can accomplish. It might not turn the world on it's head day one, but I truly think that APIs like DX12 and Vulkan (what Mantle has become for Khronos) will alter gaming more than anyone previous thought.
Click to Enlarge
As for the AMD and NVIDIA debate, both Futuremark and Microsoft continue to push upon us that this feature test is not a reasonable test of GPU performance. Based on our overclocked results with the GTX 960 in particular, that is definitely the case. I'm sure you will soon see stories claiming that one party is ahead of the other in terms of DX12 driver development, or that one GPU brand is going to be faster in DX12 than the other, but that is simply not a conclusion you can derive from the data sets provided by this test. Just keep calm and wait – you'll see more DX12 gaming tests in the near future that will paint a better picture of what the gaming landscape will look like in 2016. For now, let's just wait for Windows 10 to roll so we can get final DX12 feature and comparison information, and to allow Intel, NVIDIA and AMD a little time to tweak drivers.
It's going to be a great year for PC gamers. There is simply no doubt.
Ryan, there’s an uproar on
Ryan, there’s an uproar on techreport cause you didn’t do 4C/HT enabled. Everyone is sitting with their 47x0k wondering how it compares to the 5960x
Brought to you by the nVidia
Brought to you by the nVidia advanced commercial force :
wherever nvidia loses, it’s not something that matters
wherever they win, it’s what’s important for today
Ryan Shrout, please rename the site.
It’s nothing more than commercials to make ppl buy low end at 200€, middle at 550 and high at 1250.
Time to feed the troll: RTFA
Time to feed the troll: RTFA
Not a fanboi either way
Not a fanboi either way but…
Obvious win for AMD drivers…
Who would have guessed…
Hey, even the R9 285 beats the 980…
Kudos to DICE for wanting it and AMD for building it! Mantle for Windows and Linux… 🙂
Credit where credit is
Credit where credit is due!
AMD for the win!
This is making me very happy
This is making me very happy 🙂
This is making my investment in building my 5960x system more and more validated as the right choice.
I’m delighted with mine. I
I’m delighted with mine. I had not expected that it would make that much difference, but it does. I think the future is bright for DDR4 and 2011- and it’s only just begun.
Very interesting results.
Very interesting results. It’s a shame that there is no DX9 test, since so many games are still using that API.
If anything, this seems to highlight that AMD’s cards have been held back by CPU overheads for quite some time now – particularly when it comes to multi-threaded performance.
The main thing that seems to kill performance in current games is draw calls (primarily the view distance setting) regardless of the GPU that you’re using. Even a relatively “basic” looking game like Borderlands 2 suffered from severe performance issues in certain areas as a result of this.
While I was not enamored with the game, I went back and tested it when I upgraded from a 570 to a 970 (but kept my 4.5GHz 2500K) and though the maximum/average framerates shot up, the minimum framerate in these areas was unchanged – and well below 60 FPS.
So I wonder what this means for AMD and DX9/DX11 games. I have been considering a switch once the 390X is released, but if AMD are only going to focus on moving forward with DX12/Vulkan, it seems as though I might be better off sticking with NVIDIA cards and getting a Titan instead.
I don’t care about whether one card reaches 300 FPS vs another at 250 at the upper limit, I am more concerned about my minimum never dropping below 60.
Adding more GPU power doesn’t seem to help that when the API/drivers are the bottleneck.
And this has me hoping that Intel actually push the single-threaded performance forward soon, rather than being more efficient and adding more cores, because that still seems like it’s going to be a limiting factor until everything is running on DX12/Vulkan. And even then, DX12’s usefulness seems limited to only 6 cores.
“Late last week, Microsoft
“Late last week, Microsoft approached me to see if I would be interested in working with them..”
I never did care for how Ryan phrases some of his articles. Makes it seem like it was a pcper exclusive. Which it is not.
It pretty much comes across
It pretty much comes across exactly as it happened, which was that MS approached Ryan to see if he would be interested in working with them. We had the software early so we could prepare a review in conjunction with the launch. It's how these things work.
This test is certainly not a
This test is certainly not a way to compare specific graphics cards against one another.
BUT it is certainly a benchmark metric to compare software or hardware architectures. Maybe both.
Looks like AMD wins this round in DX12.
CPU: FX-8320
GPU: R9 290X
CPU: FX-8320
GPU: R9 290X 1090/1550
FSB: 200Mhz
CPU-NB: 2600Mhz
HT: 2600Mhz
PCI-E: 16x150Mhz
RAM: 4x2GB 1333Mhz 9-9-9-18-1T
1500Mhz(200×7.5) 6 753 922
2000Mhz(200×10) 8 596 821
2500Mhz(200×12.5) 10 05 0496
3000Mhz(200×15) 11 292 631
3500Mhz(200×17.5) 11 914 490
4000Mhz(200×20) 12 197 349
4500Mhz(200×22.5) 13 238 318
4750Mhz(256×18.5) 14 422 377 (RAM 1364mhz, CPU-NB 2560Mhz, HT 3072Mhz)
core scaling
4 (FX-43xx) 9 190 216
6 (FX-63xx) 12 126 372
8 (FX-83xx) 14 422 377
forgot to add mantle results
forgot to add mantle results with 15.3 on win 10041
The fact that jumps out the
The fact that jumps out the most to me is that a single core in DX12 beats, by a significant margin, anything in DX11.
Excellent in depth review
Excellent in depth review Ryan. We get an early and interesting preview of what’s possible and achievable with DX 12 and a nice mix of GPUs which should make for another fascinating set of numbers once windows 10 RTM.
Why does a gpu need
Why does a gpu need interrupt?(irq or msi/x)me I found the bottlneck was the default timer used by today’s system(lapic for msi/x)did Ms default the gpu interrupt msi/x timer to invariant time stamp counter instead of lapic?or is gpu doing it internally now?last I checked gpu had external interrupt (via msi/x wich use lapic as its default timer)?