Through the looking glass
The new 3DMark Time Spy lets us compare asynchronous compute performance across GPUs. What interesting stuff did we find?
Futuremark has been the most consistent and most utilized benchmark company for PCs for quite a long time. While other companies have faltered and faded, Futuremark continues to push forward with new benchmarks and capabilities in an attempt to maintain a modern way to compare performance across platforms with standardized tests.
Back in March of 2015, 3DMark added support for an API Overhead test to help gamers and editors understand the performance advantages of Mantle and DirectX 12 compared to existing APIs. Though the results were purely “peak theoretical” numbers, the data helped showcase to consumers and developers what low levels APIs brought to the table.
Today Futuremark is releasing a new benchmark that focuses on DX12 gaming. No longer just a feature test, Time Spy is a fully baked benchmark with its own rendering engine and scenarios for evaluating the performance of graphics cards and platforms. It requires Windows 10 and a DX12-capable graphics card, and includes two different graphics tests and a CPU test. Oh, and of course, there is a stunningly gorgeous demo mode to go along with it.
I’m not going to spend much time here dissecting the benchmark itself, but it does make sense to have an idea of what kind of technologies are built into the game engine and tests. The engine is based purely on DX12, and integrates technologies like asynchronous compute, explicit multi-adapter and multi-threaded workloads. These are highly topical ideas and will be the focus of my testing today.
Futuremark provides an interesting diagram to demonstrate the advantages DX12 has over DX11. Below you will find a listing of the average number of vertices, triangles, patches and shader calls in 3DMark Fire Strike compared with 3DMark Time Spy.
It’s not even close here – the new Time Spy engine has more than a factor of 10 more processing calls for some of these items. As Futuremark states, however, this kind of capability isn’t free.
With DirectX 12, developers can significantly improve the multi-thread scaling and hardware utilization of their titles. But it requires a considerable amount of graphics expertise and memory-level programming skill. The programming investment is significant and must be considered from the start of a project.
3DMark Time Spy is Beautiful
If you haven’t seen the 3DMark Time Spy demo yet, it’s worth checking out the embedded video I created below. It is running on a pair of GTX 1080 cards in SLI at 2560×1440, and is captured externally; not with any on-system tools.
I also put together a compilation of the benchmarks tests that run separately from the demo itself. They were run on the same GTX 1080 SLI setup, though I have enabled vertical sync on the system to reduce the on-screen tearing from the capture. (This does mean the frame rates you see on the screen are not indicative of any kind of Time Spy score.)
Performance Results – Testing Asynchronous Compute
One of the more interesting aspects for me with Time Spy was the ability to do a custom run of the benchmark with asynchronous compute disabled in the game engine. By using this toggle we should be able to get our first verified data on the impact of asynchronous compute on AMD and NVIDIA architectures.
Here is how Futuremark details the integration of asynchronous compute in Time Spy.
With DirectX 11, all rendering work is executed in one queue with the driver deciding the order of the tasks.
With DirectX 12, GPUs that support asynchronous compute can process work from multiple queues in parallel.
There are three types of queue: 3D, compute, and copy. A 3D queue executes rendering commands and can also handle other work types. A compute queue can handle compute and copy work. A copy queue only accepts copy operations.
The queues all race for the same resources so the overall benefit depends on the workload.
In Time Spy, asynchronous compute is used heavily to overlap rendering passes to maximize GPU utilization. The asynchronous compute workload per frame varies between 10-20%. To observe the benefit on your own hardware, you can optionally choose to disable async compute using the Custom run settings in 3DMark Advanced and Professional Editions.
I gathered data using our normal GPU test bed and the latest beta drivers from both NVIDIA and AMD.
PC Perspective GPU Testbed | |
---|---|
Processor | Intel Core i7-5960X Haswell-E |
Motherboard | ASUS Rampage V Extreme X99 |
Memory | G.Skill Ripjaws 16GB DDR4-3200 |
Storage | OCZ Agility 4 256GB (OS) Adata SP610 500GB (games) |
Power Supply | Corsair AX1500i 1500 watt |
OS | Windows 10 x64 |
Drivers | AMD: Crimson 16.7.1 NVIDIA: 368.39 |
I ran 8 different GPU configurations through Time Spy with and without asynchronous compute enabled to see what kind of performance differences we saw as a result.
Click for a Larger Version
Let’s start with our basic 3DMark Time Spy results. These show clearly that the GeForce GTX 1080 is the fastest single GPU card on the market, followed by the GeForce GTX 1070. The GTX 980 and 970 have decent showings, though they are definitely on their way out of the market. The AMD Fury X competes somewhere between the GTX 980 and the GTX 1070, falling 10% behind the current lowest priced Pascal part. The R9 Nano does very well against the GTX 980, beating it by 11%.
AMD’s Radeon RX 480 based on Polaris does well against the GTX 970 and nearly matches the performance of the GTX 980! This is a good sign for the company’s new $199-239 graphics card.
This next graph is more complex – it combines the results above with our scores with asynchronous compute disabled.
Click for a Larger Version
First, an explanation of the data: the blue bar is the graphics score in Time Spy with asynchronous compute enabled, the red bar is the graphics score with asynchronous compute disabled, and the green text show us how much scaling each GPU configuration sees going from async disabled to enabled. The higher the scaling shown on the green line, the more advantage that asynchronous compute offers for that graphics card and platform.
Let’s start with the positive results here, in particular with AMD hardware. Both Fiji and Polaris see sizeable gains with the inclusion of asynchronous compute. The Fury X is 12.89% faster with async enabled, the R9 Nano is 11.06% faster and the RX 480 is 8.51% faster. This backs up AMD’s claims that the fundamental architecture design of GCN was built for asynchronous compute, with dedicated hardware schedulers (two in the case of Polaris) included specifically for this purpose.
NVIDIA’s Pascal based graphics are able to take advantage of asynchronous compute, despite some AMD fans continuing to insist otherwise. The GeForce GTX 1080 sees a 6.84% jump in performance with asynchronous compute enabled versus having it turned off. The gap for the GTX 1070 is 5.42%. The scaling is less than we saw on both Fiji and Polaris from AMD, which again indicates that AMD has engineered GCN around asynchronous compute more than NVIDIA has with Pascal.
I did add in GTX 1080 SLI scores to this graph to show that asynchronous compute scaling drops dramatically, while also demonstrating the scaling capability of Time Spy. Adding in the second GTX 1080 results in a score that is 80% better than a single card; that’s a excellent scaling and much better than I expected for early DX12 results.
Now, let’s talk about the bad news: Maxwell. Performance on 3DMark Time Spy with the GTX 980 and GTX 970 are basically unchanged with asynchronous compute enabled or disabled, telling us that the technology isn’t being integrated. In my discussion with NVIDIA about this topic, I was told that async compute support isn’t enabled at the driver level for Maxwell hardware, and that it would require both the driver and the game engine to be coded for that capability specifically.
To me, and this is just a guess based on history and my talks with NVIDIA, I think there is some ability to run work asynchronously in Maxwell but it will likely never see the light of day. If NVIDIA were going to enable it, they would have done so for the first wave of DX12 titles that used it (Ashes of the Singularity, Hitman) or at the very least for 3DMark Time Spy – an application that the company knows will be adopted by nearly every reviewer immediately and will be used for years.
Why this is the case is something we may never know. Is the hardware support actually non-existent? Is it implemented in a way that coding for it is significantly more work for developers compared to GCN or even Pascal? Does NVIDIA actually have a forced obsolescence policy at work to push gamers toward new GTX 10-series cards? I think that final option is untrue – NVIDIA surely sees the negative reactions to a lack of asynchronous compute capability as a drain on the brand and would do just about anything to clean it up.
Regardless of why, the answer is pretty clear: NVIDIA’s Maxwell architecture does not currently have any ability to scale with asynchronous compute enabled DX12 games.
Closing Thoughts
Frequent PC Perspective readers will know that we do not put much weight on 3DMark scores when making our final recommendations for graphics cards, but the data the benchmark provides can be very useful for helping differentiate specific features as well as giving users a “spot check” to compare their own hardware to results in our reviews. Futuremark’s new 3DMark Time Spy is a great addition to this suite of tools and being one of the first dedicated DX12 tests will ensure that thousands of users will be looking forward to running the test on their own hardware this week. If nothing else, it provides a gorgeous new demonstration to show off your new GPU hardware to your family and friends!
You can pick up the free demo or the full version over on Steam or at Futuremark’s website.
This doesn’t match the
This doesn’t match the standing from the Doom benchmarks, at all.
Here we see the RX 480 equal to a GTX 980
In contrast Doom show its more in the GTX 980ti range,
a HUGE delta VS a plain GTX 980.
Could it be that games dont use the same workload balance,
and so this futuremark benchmark is very “synthetic” ?
It implies that depending on
It implies that depending on the engine you are going to see varying results.
Erhm not sure what you mean,
Erhm not sure what you mean, but doom vulkan reviews, what I have seen, puts Rx 480 equal to gtx980 not gtx980ti. I.E.
http://www.eurogamer.net/articles/digitalfoundry-2016-doom-vulkan-patch-shows-game-changing-performance-gains
at 2560x1440p : rx480 79fps
at 2560x1440p : rx480 79fps vs GTX 980ti 85.1fps
The GTX 980ti gets you an extra 8%, so I can see how this can be considered not equal.
https://www.computerbase.de/2016-07/doom-vulkan-benchmarks-amd-nvidia/
BTW, I’m curious why the same game with the same card show in one test an 8% delta, but your link shows 28% delta.
something is fishy
Doom doesn’t use async
Doom doesn’t use async shaders on Nvidia cards yet. Waiting on a driver update.
Just like the async driver
Just like the async driver for Hitman, and Ashes, right? (;
Clearly Pascal uses
Clearly Pascal uses async.
Amd fanbois wrong again.
No driver update is going to
No driver update is going to use async shaders that are simply not there fully in the GPU’s hardware on Nvidia’s GPUs. Nvidia is going to have to use software/middleware/drivers to manage Pascal’s shaders in an async-compute emulated in software fashion.
Nvidia’s shaders are managed by software that can not respond as quickly as AMD’s fully in hardware async shaders can to changing asynchronous workloads.
So wait for Nvidia’s software/driver solution but do not expect that any software/driver based attempt at managing A GPU’s shaders is ever going to best the full async shader hardware implementation that AMD is using. AMD’s asynchronous shaders and ACE units and hardware schedulers are fully implemented in the GPUs hardware, and do not need any slower software/driver emulation layers to do their job.
You would not expect that Intel would want to implement its version of SMT, HyperThrading(TM) in software, as that would be a disaster for its CPUs performance and IPC metrics. And SMT(Simultaneous Multi-Threading) is the very definition of an async-compute type of functionality on a CPU/processor, the same goes for any GPU/Processor and a GPU’s need to manage its GPU/processor threads with hardware based schedulers to achieve the most optimal results. Software is simply not fast and responsive enough to be used to manage any processor’s core execution resources, especially where processor management of multiple processor threads is concerned. GPU’s are very multi-processor-threaded in nature and can run thousands of processor threads so any software/driver based shader management is never going to be as responsive to any rapidly changing asynchronous events as having the management done by dedicated asynchronous management hardware units in the GPU’s hardware.
Enjoy your emulated async shader experience(with all its added latency toppings), because it not managed by the GPU’s hardware with Nvidia current GPU offerings.
That isn’t a valid
That isn’t a valid comparison. SMT is necessarily a very fine grained process. It would be nearly impossible to do in software. The EPIC/IA-64 mess was kind of an attempt to implement software scheduling for CPUs. It did not work well. GPUs are completely different level. An asynchronous compute job is generally very large chunk of processing. This is also why it is worthwhile to leave shader code in a higher level representation. A shader will generally be streaming SIMD code with a very small amount of code for a very large amount of data. It makes sense to compile it for the specific architecture at runtime, rather than setting a specific ISA that all architectures must implement or emulate like what Intel has done with Xeon Phi. Requiring a complete recompile is okay in the HPC market, but not for a consumer Windows application.
Implementing the scheduling for such a coarse grained processing model in software is not necessarily a bad idea. The hardware still needs to support allocating resources for these jobs among other things like quick task switching and preemption. The support for these features in Pascal seem to be workable, but probably not optimal. From the little I have read about it, it sounds like it has a task switching penalty that will reduce efficiency. Nvidia can afford to throw a lot of software developement resources at the problem though, even if hardware features are lacking. Nvidia’s previous architectures do not seem to have workable hardware features, so asynchronous compute will probably never be a usable feature on Maxwell. Previous AMD architectures should be able to handle it fine though. All GCN based cards have ACEs; 2 in the first generation, 8 in later GPUs until they seemed to switch to 4 in Polaris 10. The reduction to 4 in Polaris is one of the things that make me wonder if AMD will be pushing multi-gpu heavily, but that is another discussion. The AMD 290/390 cards are still good performers and should get good software optimization for a while, they are just going to cost a bit more on the electric bill. The Maxwell based cards are not in that great of a support situation though. A lot of the low end performance targeting will be for consoles which all support asynchronous compute. The 970 has a large enough installed base for it to get targeted for a while, but that support may dry up quickly, leaving them with code paths not well optimized.
This answer is from Ryan
This answer is from Ryan Smith from AnandTech to the question if pascal supports true asynchronous compute.
“Wait, isn’t Nvidia doing async, just via pre-emption? ”
No. They are doing async – or rather, concurrency – just as AMD does. Work from multiple tasks is being executed on GTX 1070’s various SMs at the same time.
Pre-emption, though a function of async compute, is not concurrency, and is best not discussed in the same context. It’s not what you use to get concurrency.
Edit:This answer is from Ryan
Edit:This answer is from Ryan Smith from AnandTech to the question whether pascal supports true asynchronous compute.
Ryan Smith – Thursday, July 14, 2016 – link
“Wait, isn’t Nvidia doing async, just via pre-emption? ”
No. They are doing async – or rather, concurrency – just as AMD does. Work from multiple tasks is being executed on GTX 1070’s various SMs at the same time.
Pre-emption, though a function of async compute, is not concurrency, and is best not discussed in the same context. It’s not what you use to get concurrency.
Processor-thread concurrency
Processor-thread concurrency managed by software(Nvidia) and processor-thread concurrency managed in hardware(AMD) and which is going to be more responsive. Look for an example in Intel’s in hardware concurrent management of processor threads in its version of SMT(Simulations Multi-Threading), HyperThreading(TM), and see that no software/driver is ever going to be able to manage an execution units FP/Int/Other pipelines in a responsive enough manner it’s about managing execution pipelines that are changing states faster than even the system clock so how would any software/driver management be effective for managing a processors execution units.
The execution unit’s pipelines have to be managed with hardware that can preempt one processor thread’s running state and save that state and load another thread’s state and begin processing that thread’s workload with very little or no pipeline bubbles(wasted pipeline slots). And there are many events that will trigger the hardware scheduler/dispatcher attention like a stalled thread waiting for a data/computation dependency to complete, or a request for a higher priority thread to preempt a lower priority thread’s execution to get a very low latency time dependent task completed. And Any software based processor-thread management scheme written in code itself that has to be fetched, decoded, dispatched, and executed is going to waste many pipeline slots that will be needed to be managed at a faster rate than even one op-code can be executed in any software managed solution.
Even the best attempts at hiding latency with drivers and optimized kernels/threads is not going to be able to manage any asynchronous event that may obsolete the currently running processor threads and force them to have to be suspended and new processor threads loaded and worked on. The management of any asynchronous GPU processor thread events is going to have to be done by hardware schedulers/dispatch units on all the different types of GPU shaders, and any other hardware units that need to respond to changes with the lowest latency possible or execution pipeline cycles will be wasted and GPU execution resources will go underutilized even with work waiting in the queues needing to be done.
P.S. “true asynchronous compute.” You are missing one important qualifying phrase in that quoted materal and that phrase is Hardware Based!
True [Hardware Based] asynchronous compute, is what is being discussed and argued. Also your statement: “Pre-emption, though a function of async compute, is not concurrency, and is best not discussed in the same context. It’s not what you use to get concurrency.”
Pre-emption, the hardware kind of preemption on a processor’s core is what happens in all processors that are using SMT/SMT like hardware and scheduling processor threads on processor’s core/GPU’s shader/etc. How else is a processor going to be able to manage its processor threads without an instruction scehduler/dispatcher able to suspend/preempt one thread running on a core and change the context to another thread and work on that thread’s code.
Concurrency can not be managed without preemption. Concurrency on a processors shared execution resources is managed by the scheduler/dispatcher and a thread of instructions waiting on a data/code dependency has to be prempted by the scheduler/dispatcher so the running thread can be context switched out and a new thread’s code context switched in and its code executed.
I think you are confusing the software kind of “thread” with the hardware kind of processor thread/hardware thread management that any code does not see if the processors thread management is implemented in hardware. That hardware processor thread management happens at below the instruction level on CPUs/and AMD’s GCN GPU ACE units and hardware schedulers/etc. It’s not about software premption or management of an application’s software “Threads” that is done by the OS and application software etc. It’s about processor thread management fully in the processor’s hardware to manage the processor’s Processor-Thread asynchoronous events, which needs to be managed by fully in the GPU’s hardware by hardware based types of scehduler/dispatch(hardware scheduler) units, ACE units etc.
Sure there can be outside events that are passed down into ring 0 of the OS kernel to trigger an OS level of preemption and cause the CPU to be tasked with other work, and Windows, Linux, and OSX/MacOS are preemptive multi-tasking OSs that manage 100’s of running services and applications/software threads. And even GPUs are running basic OS(Embedded OS) that the user never sees to manage the GPU’s on PCIe card many assets! GPUs are processors too with their own memory and firmware and Embedded OS to manage the GPU/GPU’s command buffer/command queue while it works on the CPU/Game issued kernels. GPUs are running an embedded OS that looks for and services its command buffers/queus and that Embedded OS and firmware manages GPUs resources memory/etc.
There is concurrency and parallel execution happening on a GPU’s core at the same time that is driven by preemption in the software/OS and by the hardware instruction kind of processor thread preemption where AMD has an advantage with its hardware asynchronous shaders, ACE units and hardware schedulers.
You are going to have rely on
You are going to have rely on other sources besides Ryan Smith and other reviewers from other websites, because they have review manuals with strict restrictions about what can and can not be said by a reviewer or website. And most websites get their bread and butter from the markers of the devices that they review. That includes future review samples, and advertising. It’s just the nature of the beast and the industry wide conflict of interests!
There is the software based async-compute for software async-events and concurrency management and there is the hardware based/Processor’s Processor-Thread async-events/compute management that the software never sees. And the hardware Processor-Thread management part can never be as efficiently done with software/driver code, it has to be managed on the processor’s hardware by the specialized on core/on GPU shader units and ACE units and hardware schedulers. It’s best to read what the games developers/VR games developers are saying about fully in the hardware async-compute and processor-thread management units. The gaming engine developers are the ones with the most information about the hardware that they are developing for, so they have the Real manuals and can to a point talk more freely about the GPUs hardware, even the developers are under limiting NDA with some of the manuals that they have access to, so they can not be as specific with regards to some of the GPU’s hardware facilities.
I disagree. There is
I disagree. There is pre-emption in Nvidia because it operates in serial whereas AMD operates similar to Intel’s hyper-threading or in parallel. Nvidia’s pre-emption works better along a single flow of data, suiting DX11 whereas AMD has many lanes or threads to earmark whatever the developer wants to carry on that thread (DX12, Vulkan). Maxwell are the first cards to have pre-emption and my guess is Nvidia might expand on this by juggling where on the single thread it wants to chop and change tasks but in the end, this technology is the only thing standing in the way of an AMD Tsunami and Nvidia knows it. It might not feel like it know, but its starting to happen.
Oh yes they do. Works on
Oh yes they do. Works on pascal at least. But only in places where the gpu was not fully utilized before.
there was a new driver out
there was a new driver out today. To use Async you are suposed to use TSSAA which when i used it saw a good 30% boost in fps turning it on, but vulkan was still 25-30% behind compared to Opengl which likely have to be an update. I tested vulkan vs opengl on last driver not the new one just out.
Doom doesn’t use async
Doom doesn’t use async shaders on Nvidia cards yet. Waiting on a driver update.
No driver update is going to
No driver update is going to use async shaders that are simply not there fully in the GPU’s hardware on Nvidia’s GPUs. Nvidia is going to have to use software/middleware/drivers to manage Pascal’s shaders in an async-compute emulated in software fashion.
Nvidia’s shaders are managed by software that can not respond as quickly as AMD’s fully in hardware async shaders can to changing asynchronous workloads.
So wait for Nvidia’s software/driver solution but do not expect that any software/driver based attempt at managing A GPU’s shaders is ever going to best the full async shader hardware implementation that AMD is using. AMD’s asynchronous shaders and ACE units and hardware schedulers are fully implemented in the GPUs hardware, and do not need any slower software/driver emulation layers to do their job.
You would not expect that Intel would want to implement its version of SMT, HyperThrading(TM), in software as that would be a disaster for its CPU’s performance and IPC metrics. And SMT(Simultaneous Multi-Threading) is the very definition of an async-compute type of functionality on a CPU/processor, the same goes for any GPU/Processor and a GPU’s need to manage its GPU/processor threads with hardware based schedulers to achieve the most optimal results. Software is simply not fast and responsive enough to be used to manage any processor’s core execution resources, especially where processor management of multiple processor threads is concerned. GPU’s are very multi-processor-threaded in nature and can run thousands of processor threads so any software/driver based shader management is never going to be as responsive to any rapidly changing asynchronous events as having the management done by dedicated asynchronous management hardware units in the GPU’s hardware.
Enjoy your emulated async shader experience(with all its added latency toppings), because it not managed by the GPU’s hardware with Nvidia current GPU offerings.
completely wrong information
completely wrong information and you have no clue what async compute is.
First of all Async Compute is a concurrent execution model , not parallel, it ensures IDLE CYCLES of a GPU are completely utilize, which results in lesser time for more unit work done on the GPU.
First Pascal as hardware async support , dynamic load balancing in pascal is implemented in hardware, Unlike async shaders which are individual units in GCN, the async compute in pascal is each indvidual core itself.
So PASCAL SUPPORTS ASYNC COMPUTE IN HARDWARE and ASYNC COMPUTE is not the same ASYNC SHADERS
ASYNC SHADERS in GCN just one implementation of ASYNC COMPUTE
NVIDIA has implemented the same thing differently in PASCAL HARDWARE in form of it’s dynamic load balancer and aided by improved pre-emption(even though pre-emption is not async as such)
Kindly don’t come and dump AMD marketing and fanboyism here
For AMD’s async shaders the
For AMD’s async shaders the async compute is fully implemented in the hardware, and on Pascal CUDA code is required to get at any instruction level scheduling granularity! AMD’s hardware solution is faster and more responsive. AMD still can interleave graphics and compute threads to a better degree on its hardware.
Yes Async Compute is a concurrent execution model but AMD GCN GPUs have ACE units and hardware schedulers that manage the concurrent execution of processor threads in parallel across thousands of available CUs. So AMD’s hardware managed thread context switching is more responsive to changing workloads. So it’s AMD’s full hardware enabled async shaders, and ACE units and hardware schedulers working to manage many thousands of concurrently executing GPU processor threads that makes for the low latency hardware managed response on AMD GPUs that is getting that extra boost from Vulkan/DX12 optimized games. AMD’s thread context switching and thread priority management features on Polaris are even more efficient in getting any stalled threads context switched out and other threads started up to keep the GPU’s execution resources more fully utilized.
Polaris even has instruction pre-fetch to help keep the execution units loaded with very little in the way of pipeline bubbles having to be introduced to address any single thread induced execution latencies. With each GCN generation AMDs ACE units and hardware schedulers and other units are becoming more CPU like in their abilities to handle all types of asynchronous workloads, both graphics and compute workloads.
Pascal is an improvement for Nvidia, but there needs to be more done by Nvidia to fully integrate the entire hardware managed process into its GPUs and still Nvidia has less resources for FP/int compute. Look at the RX480s total FP compute/flops ratings relative to the GTX 1060’s, and the GTX 1060 is clocked much higher.
Again with digressing and argumentation of the term async-compute, when the argument is not about async-compute it is about having the async-compute management fully implemented and managed in the GPU’s hardware like AMD has been doing as opposed to trying to manage the async-compute events in software like Nvidia is doing.
Try again, you keep making
Try again, you keep making the same mistakes and inaccuracies in full shallowed fashion.
It’s not about fanboyism for
It’s not about fanboyism for AMD it’s about the hardware technology that even the VR games makers want all GPUs to have, so Nvidia needs to do more. Nvidia needs to be forced by competition to lower its prices, and to devote more time to engineering into its GPUs more compute resources and full hardware management of any async-compute assets on its GPUs. Software management of the concurrent execution of many thousands GPU processor threads is not going to be as efficient and as responsive as full hardware management is! Just Ask Intel about how they manage their version of SMT, HyperThreading(TM) on their CPU/processors for Intel’s concurrent execution of processor threads!
How many times you are going
How many times you are going post the same garbage? your way to approach such a complex matter with such a shallowness it’s overwhelmingly impressive.
How many times the funds does
How many times the funds does Nvidia have to hire folks like you to attack other posters. If you want to attack do it point by point. Maybe Nvidia needs to get their marketing department out of the process of interfering with their white-paper engineers/writers. But software/driver management of hardware functions such as any processor-thread management is never going to be as good as hardware managed process-thread scheduling and dispatch, just go and tell Intel to move their SMT/HyperThreading hardware management units into software, and see the response you will get. GPUs are processors too and some GPUs are getting more of that CPU style functionality for hardware based async-compute/GPU processor-thread management implemented fully in the GPU’s hardware.
Nvidia has improved its GP100 thread scheduling granularity but it’s still managed in software and software is not as responsive to asynchronous events as hardware is, especially for processor-thread management. Oh yes if you want instruction level granularity control on GP100 you are going to need CUDA to do the job, we would not want to ruin Nvidia’s vendor lock-in would we.
I can not wait for AMD to get back into the server/GPU accelerator business with its Zen based server SKUs and its Zen/Vega/Greenland HPC server APUs on an interposer module with HBM2. Let’s See if NVLink can compete with AMD’s HPC/Server/Workstation APU on an Interposer module SKUs for Raw CPU cores to GPU die effective bandwidth. It’s going to be good to see some more competition in the GPU accelerator market.
Just because something
Just because something happens in one test doesn’t mean that it has to happen in another. Engines behave differently, things are coded differently.
pretty sure the next crimson
pretty sure the next crimson driver release notes will say “performance improvements of ~15% for 3dmark time spy)
Synthetic indeed. This is
Synthetic indeed. This is supposed to test dx12 performance. Doom cannot be compared as it is dx12 alternative opengl or Vulkan.
This benchmark is going to favor AMD highly. The compute shader invocations went up from 1.5 million to 70 million almost 4667% (46.7x) increase. And 8.1 million to 70 million almost 864% (8.6x) increase.
The tesselation on the other hand went up from 500k to 800k for a 60% (.6x) increase. And 240k to 2.4 million for a 1000% (10x) increase.
Whose graphic hardware is heavy on compute? Well before Pascal anyways.
Maybe we should accuse them of biasing toward AMD. A little piddling 6.1% average worth of async help for Pascal on Nvidia’s side isn’t going to do much against such gimping.
Oh hold on AMD still ended up with more improvement of 12% average for Fury cards and 8.5% for Polaris Rx 480. WTF are AMD fanboys on. You still get double benefit with older Fury cards and you’re still not happy. LMAO
……….I am starting to
……….I am starting to think Nvidia forsakes the drivers based async compute for maxwell and enabled it only for pascal so nvidia can market pascal instead.
They weren’t really showing
They weren’t really showing off Pascal’s ability to handle async compute, dude. Usually they would go “Our new card rocks! Look what it can do!” But this time, they were quite cryptic and unenthused about async compute in particular.
Also (and this is just a pure guess), I think the gains from Maxwell getting a proper driver to handle async compute would just not be worth the time and effort for them. They probably looked at it and went… “well, our cards are still good performers, so we can afford to leave them as they are”.
Hell, I’ve heard some higher ups in Nvidia say that they’re not happy with even how pascal handles async compute. And that they can do much better.
I think maxwell just doesn’t
I think maxwell just doesn’t have the hardware to do asynchronous compute efficiently; no driver update is going to make it workable. This will leave developers supporting multiple paths. One for consoles and AMD graphics cards. Possibly a tweaked version for modern Nvidia cards; it is unclear whether they are using exactly the same path for pascal graphics cards since their asynchronous support is different. They then need another version for old Nvidia cards, anything Maxwell and before. AMD has had asynchronous compute engines for a long time with all GCN cards, so they should be able to use the same or very similar code paths for almost all AMD hardware (consoles and all GCN based PC hardware). Nvidia Maxwell cards do seem to be set up for premature obsolescence, while AMD cards are almost too forward looking. The 290/390 probably burns a lot of power on ACEs even though they came out years before asynchronous compute became an important feature. They are still a good card now though, except for the power consumption.
It does sound like running asynchronous compute will take some extra developer work, but I am wondering if some of that can be done in the game engine, without needing developers to do a lot of very low level work. I would also like to know how much different the game engine will need to be for consoles vs. AMD graphics cards. If asynchronous compute can get more performance on the Xbox One or PS4, then it is probably going to be worth it for developers to make use of it. Although, the memory system is still different unless you have an APU based system. I don’t know how far we are from having HSA with dedicated GPUs, if we ever get there. Asynchronous compute sounds like a great idea for making use of extra compute resources though. Asynchronous compute jobs shouldn’t care where they run, so it seems like they could set it up such that asynchronous jobs could run on a second GPU without any explicit multi-gpu support. It seems like they should be able to run such asynchronous jobs on an integrated GPU, where running explicit multi-gpu with such unequal GPUs would not be worth it.
You should also add GPU%
You should also add GPU% utilization before an after AS.
You are right, Ryan. There is
You are right, Ryan. There is a lot of misinformation being spread about Pascal not having ASync support. And I’m sure AMD fanboy’s will now change from “Pascal doesn’t support ASync” to “Pascal isn’t as good in Async as AMD is”.
nVidia has implied they are still not done optimizing DX12 and Vulkan. And nobody should be counting anyone out this early.
NVidia Pascal does NOT have
NVidia Pascal does NOT have hardware asynchronous compute support. Period.
What they did, compared to Maxwell is implement more accurate preemption with faster load-balancing, that gives Pascal certain async-like capabilities, but it still not the proper support. When GCN can switch its shaders between Graphics and Compute mode with no overhead, NVidia Cuda cores still have to do context switch, which is quite expensive time-wise. New technologies allow them to reduce the amount of such switches which has to happen as well as reduce the overhead, but nothing more.
Wrong pascal has hardware
Wrong pascal has hardware async support in form of their new dynamic load balancer(in hardware), which basically ensures concurrent execution and no cycles are wasted. This is exactly what async compute is about, not parallel execution but concurrent execution.
The hardware implementation in GCN uses seperate units(Async shaders) for this, where as in pascal this part of each core/SM
They achieve the same async compute via different hardware implementations, that is what it is
PS : async shaders is not the same as async compute
“I did add in GTX 1080 SLI
“I did add in GTX 1080 SLI scores to this graph to show that asynchronous compute scaling drops dramatically, while also demonstrating the scaling capability of Time Spy. Adding in the second GTX 1080 results in a score that is 45% better than a single card; that’s a far cry from the 80-90% scaling rates we often saw for multi-GPU configurations on DX11, but it is at least a start for DX12.”
those numbers seem off, 12412/6886 = 1.802, i.e. that’s 80% scaling in my book not 45%. I’d say it is scaling as expected.
They measured 45% from the
They measured 45% from the top score of 12412 – 5585 for a score of 6827 must be slight rounding.
Not sure which way it normally is calculated though.
Let’s see some AMD fanboy
Let’s see some AMD fanboy post in here that Pascal lacks async now LOL.
I would like to see performance boosts from older GCN cards as well. My guess is maybe they didn’t age gracefully. Having weaker async hardware much like Maxwell did.
Also would like to see power consumption numbers for Timespy. With and without async enabled. Maybe this could replace Furmark as a stressor of video cards for directx12.
This is NOT async compute,
This is NOT async compute, GTX cannot switch between shader modes without overhead. GCN has hardware support for that, NVidia uses preemption with load balancing to reduce the costs.
The dynamic laod balancer
The dynamic laod balancer essentially ensures concurrent execution / async compute, and where did you hear the over head rumor? Do you have proof or some AMD marketing slide?
The overhead is basically in
The overhead is basically in the pascals tier 2 resource binding capability. That is clear proof that pascal is a revisited maxwell.
“I would like to see
“I would like to see performance boosts from older GCN cards as well. My guess is maybe they didn’t age gracefully. Having weaker async hardware much like Maxwell did.”
I too would like to see older GCN cards in this benchmark, but just because Maxwell shows no gains with DX12 or Vulkan doesn’t support throwing out a guess that GCN cards might also show low performance in this benchmark; because it ignores numerous other DX12 and Vulkan benchmarks that have been published. It also ignores the physical specs of the hardware. A R9 290X from 2013 (GCN 1.1) has 8 Asynchronous Compute Engines, to guess a card like this would not perform better in DX12 is completely unsupported and sounds like Maxwell sentimentalism.
Not sure why it’s so hard for people to give AMD a nod that maybe they did a good job by including this feature in their silicon. I own cards from both brands, and think NVIDIA did a better job with their schedulers in the GTX 1080 than in Maxwell, so there- a compliment even though I personally don’t like NVIDIA’s business ethics.
And I have no delusions that my R9 290X would be in the same league as the 1080, but for $290 two years ago, being able to get 30% better performance (edit: vs my same card in DX11) while playing a game like Doom is a perk, that should be recognized and not envied.
as you can clearly see from
as you can clearly see from the CPU scores that are higher for the rx 480, the nvidia driver has to work hard to get a performance benefit from enabling async.
the videokardz test showed that AMD doesn’t suffer at all in the CPU department when async is enabled while nvidia’s pascal does.
interestingly the rx 480 wins the cpu scores both with async on and off.
Ryan BIOS as always
NVIDIA’s
Ryan BIOS as always
The % increase when asynchronous compute is turned on is a result of the improved compute preemption found in Pascal over Maxwell
Maxwell = Preemption
Pascal = Maxwell Preemption with added Compute Preemption
To Nvidia preemption = Asynchronous
During Maxwell
http://international.download.nvidia.com/geforce-com/international/images/nvidia-geforce-gtx-980-ti/nvidia-geforce-gtx-980-ti-directx-12-advanced-api-support.png
Now with Pascal compute preemption = asynchronous
https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
pascal does more than
pascal does more than pre-emption (which is not necessarily async), pascal has a hardware dynamic load balancer which is it’s implementation of async compute in hardware,
The reason for the improvement is Dynamic laod balancing not pre-emption(Though improved pre-emption does add to async, but it is not really async)
Marketing for improved
Marketing for improved schedular
Ryan Smith’s answer
Ryan
Ryan Smith’s answer
Ryan Smith – Thursday, July 14, 2016 – link
“Wait, isn’t Nvidia doing async, just via pre-emption? ”
No. They are doing async – or rather, concurrency – just as AMD does. Work from multiple tasks is being executed on GTX 1070’s various SMs at the same time.
Pre-emption, though a function of async compute, is not concurrency, and is best not discussed in the same context. It’s not what you use to get concurrency.
link: http://www.anandtech.com/comments/10486/futuremark-releases-3dmark-time-spy-directx12-benchmark/508014
Ran it and got a nice 5694
Ran it and got a nice 5694 score with my OC GTX 980 Ti
If games continue coming out
If games continue coming out showing massive improvements on AMD cards and none for Nvidia, I wonder what Futuremark will have to say about it.
For now this will cheer up Nvidia users. They finally have a ….game that makes their GPUs look a little better under DX12.
Those Nvidia “async drivers”
Those Nvidia “async drivers” for Doom, Hitman, Ashes, and likely every AAA title released this year are surely on their way.. any day now… Right? (;
My understand is that AMD
My understand is that AMD needs a little help from the developers, and low-level API to get maximumum performance. Nvidia doesn’t.
That is why Nvidia is pretty much plug and play. Whilst AMD is plug, hope it doesn’t melt motherboard, wait for Vulkan support, then lick a giant warted asshole, then play.
Nvidia needs people like you.
Nvidia needs people like you. Keep buying Founders Editions.
With nvidia drivers you get
With nvidia drivers you get near 100% performance outta their cards were as AMD driver being so unoptimized you lose a ton. Look at dx11 vs dx12 as proof. Yes AMD gains a ton cause their dx11 driver is junk. Were as Nvidia card their dx11 performance is pretty much as close to what dx12 gives.
That’s one of the most absurd
That’s one of the most absurd arguments I’ve ever heard; it’s a little more complicated than that.
DX12 brings new technologies like low level hardware abstraction, reduction in draw call overhead, Explicit Multi-Adapter, combined video memory, Multi-GPU Resource Pooling etc.
The AMD architecture is well suited to take advantage of some of these new techs, where the Nvidia Maxwell generation depended on raw speed, but won’t do so well going forward.
Yes, the AMD chips ran a little hotter and took more energy, part of that was hardware based, that they had 2, 4, then 8 ACE units that Maxwell left out. And as for the DX11 drivers, huge generalization. The same 3 yr old card that is doing so well with DX12 and Vulkan, the R9 290X’s review on this site showed it running neck and neck with the Titan and besting the GTX 780.
It’s like comparing the Pentium 4 Netburst architecture against the Atholon 64 Dual core CPUs, the differences didn’t have to do with driver optimization, they were very different hardware designs.
The GTX 1080 is definatly a better design, and has the performance crown, but you have to consider the design of the chips. I doubt all the drivers in the world won’t fix Maxwell so that it gains performance in DX12.
https://pcper.com/reviews/Graphics-Cards/AMD-Radeon-R9-290X-Hawaii-Review-Taking-TITANs/Battlefield-3
Yes, it’s complicated, but
Yes, it’s complicated, but your argument it’s actually worst, I don’t know where you see Maxwell having more “raw speed” (as you call it) than GCN counterparts does, it’s actually the opposite with Radeon having more shader resources but not being able to fully use them.
That’s why some asked for GPU utilization numbers
And another big thing you are missing is that unlike the rest of DX12 and Vulkan that are console ports, which means designed from the ground up for GCN, 3DMark is developed specifically for PC
By raw speed I simply meant
By raw speed I simply meant Maxwell performed well and had high frame rates in DX11 games, sometimes besting the availible AMD offering of the same period in time, but won’t be able to take advantage of DX12 due to the design and architecture of the hardware, whether DX12 is ground up or not.
Ew, no thanks. I mean I do
Ew, no thanks. I mean I do love nvidia, but I have standards.
Why in God’s green earth
Why in God’s green earth would I want to “lick an NVidia fan”
Here is a good pic from
Here is a good pic from TomsHardware regarding Pre-Emption and Asycronous Shaders to help the conversation along:
http://media.bestofmicro.com/E/O/592800/original/pastedImage_5.png
http://www.tomshardware.com/reviews/amd-radeon-rx-480-polaris-10,4616.html
Really nice visualization.
Really nice visualization. Thanks, mate
Anedia boys, the game is
Anedia boys, the game is over. There will no driver from Anedia to support async compute.
Anedia has controlled millions of fans by false promises.
Say what now?
Say what now?
I think “Anedia” = Nvidia
I
I think “Anedia” = Nvidia
I have no idea what the correlation is, but it’s the only thing that makes sense to me.
“In numerology the name
“In numerology the name Anedia has the birth path 7 and it’s meaning is connected to faith, knowledge and openness to others.”
Nawww, he loves nVidia!
bhuahahaha
bhuahahaha
Nice, that one made me laugh.
Nice, that one made me laugh.
It doesn’t matter, the
It doesn’t matter, the brainwashed followers of the church of Anedia (the nvidiots) will still continue to buy flounders edition cards and keep Anedias bank manager happy.
And what games actually are
And what games actually are performance-bound by Async compute? That’s right – not a single one. It’s only when you set up ridiculous test scenarios using settings no one would ever use that you run into this.
Nvidia aren’t stupid. They talked to devs, found out the feature was “meh”, and passed on it.
devs publicly praised async
devs publicly praised async compute. Who on earth did nvidia talk to? They can’t talk them out of using it on consoles and bringing the same code to pc, so that’s a bust.
Like when they talked to the
Like when they talked to the AotS devs last year and told them that Maxwell had Async compute capability?
And then when the AotS devs tried to use it and it absolutely wrecked Maxwell, Nvidia talked to the AotS devs and tried to get them to abandon Async compute entirely?
And when the AotS devs refused, and instead just disabled it on the Nvidia code path, Nvidia stopped talking to the devs and talked to the media and whined about how the AotS devs were just doing it wrong?
Face it, the only time Nvidia talks to game devs is to try to push ShameWorks libraries onto them (only to tie them into a legal requirement to use the latest libraries available, and then push them an update a week before release which – mysteriously – wrecks performance on anything non-Nvidia) or to try to stop them from using anything that makes AMD look good.
The Day the Earth Stood
The Day the Earth Stood Still.
Would be nice if you guys
Would be nice if you guys went into more technical detail on this. Especially what this asynchronous compute implementation is. Sounds like they are just using it for shader efficiency. I can only guess but it doesn’t really seem like what most games are going to be doing, at least not all they will do.
Since you guys should know more it would be nice to dig deeper rather than assume its all the same. What I would like to know is if they are submitting compute tasks concurrently or if “overlapping render passes” does not do that. Async also involves other components of the GPU beyond just the shaders apparently.
That overlapping sounds like filling gaps in the GPU with work from another render pass. Sounds like preemption to insert that in those gaps.
Just guessing, but the point is you should really go deeper into this.
Again it is up to game
Again it is up to game developer to code in asynchronous compute support for Nvidia in dx12. There’s a snowball’s chance in hell an AMD allied developer is going to put it in their game for Nvidia. Sorry that’s just the way I see it.
Assuming your premise is
Assuming your premise is accurate, my response is simply, suck it up buttercup. Nvidia’s been doing that to AMD for YEARS. Now you get to deal with it for a while.
Been doing exactly what.
Been doing exactly what. Kicking their butt in the AIB market. Oh. I think you mean rigging games with unnecessary tesselation or Gameworks right. It’s nothing like the absurd amounts of compute AMD games need even if there is a more efficient way of doing it. Most Gameworks features are optional and you have a tesselation cheat in CCC that can limit it. So what exactly are you referring to.
In fact AMD was at the lead at start of dx11 and touted tesselation performance as well. They had at least a 6 month lead due to their relationship with Microsh*t. Same as any recent directx launch. But who’s cheating. Nvidia owned dx11 over time so much that AMD had a new one created where they could dominate. But sadly they will end up losing here too. It’s only a matter of how much time. Nvidia is already leading after almost one year. What a sad adoption rate of dx12 (probably poorer than dx10). Yet it is still around solely because of Win 10 selling point.
http://www.trustedreviews.com/news/AMD-Launches-First-DX11-Graphics-Cards
https://www.engadget.com/2009/06/03/amd-shows-off-worlds-first-directx-11-gpu/
https://blogs.windows.com/windowsexperience/2010/03/26/nvidias-first-directx-11-capable-gpus-coming-to-market/
Here is directx 10 games list
Here is directx 10 games list for this short lived directx.
https://en.m.wikipedia.org/wiki/List_of_games_with_DirectX_10_support
Adoption rate of dx12 looks like it may be on track to match them. We’ll see in two more years if it’s still around. LOL
Soooo… TimeSPY is just an
Soooo… TimeSPY is just an nvidia context switching benchmark program, not Async?!
That’s false advertising, and very disappointing. 🙁
http://www.overclock.net/t/16
http://www.overclock.net/t/1605674/computerbase-de-doom-vulkan-benchmarked/220#post_25351958
“3D Mark does not use the same type of Asynchronous compute found in all of the recent game titles. Instead.. 3D Mark appears to be specifically tailored so as to show nVIDIA GPUs in the best light possible. It makes use of Context Switches (good because Pascal has that improved pre-emption) as well as the Dynamic Load Balancing on Maxwell through the use of concurrent rather than parallel Asynchronous compute tasks. If parallelism was used then we would see Maxwell taking a performance hit under Time Fly as admitted by nVIDIA in their GTX 1080 white paper and as we have seen from AotS.
GCN can handle these tasks but performs even better when Parallelism is thrown in as seen in the Doom Vulkan results. How? By reducing the per Frame latency through the parallel executions of Graphics and Compute Tasks. A reduction in the per-frame latency means that each frame takes less time to execute and process. The net result is a higher frame rate. 3DMark lacks this. AotS makes use of both parallelism and concurrency… as does Doom with the new Vulkan patch.
If 3D Mark Time Fly had implemented a separate path and enabled both concurrency and parallelism for the FuryX… it would have caught up to the GTX 1070. No joke.
If both AMD and nVIDIA are running the same code then Pascal would either gain a tiny bit or even lose performance. This is why Bethesda did not enable the Asynchronous Compute + Graphics from the AMD path for Pascal. Instead… Pascal will get its own optimized path. They will also call it Asynchronous Compute… people will think it is the same thing when in reality… two completely different things are happening behind the scene.
See why understanding what is actually happening behind the scenes is important rather than just looking at numbers? Not all Asynchronous Compute implementations are equal. You would do well to take note of this.
Where are the tech journalists these days?”
Copied straight from Mahigan
Copied straight from Mahigan I see. Yes it doesn’t use the same way of doing async because Nvidia hardware is different than AMD and thus needed coded differently. Both versions are designed to run optimally on each vendor’s hardware. If I understood everything correctly from entire Steam and Reddit posts by 3dmark. Nvidia does in fact do async compute but not async compute + graphics which is hardware shaders in AMD cards.