Yes, We’re Writing About a Forum Post
What is asynchronous compute, and how is it interpreted?
Update – July 19th @ 7:15pm EDT: Well that was fast. Futuremark published their statement today. I haven't read it through yet, but there's no reason to wait to link it until I do.
Update 2 – July 20th @ 6:50pm EDT: We interviewed Jani Joki, Futuremark's Director of Engineering, on our YouTube page. The interview is embed just below this update.
Original post below
The comments of a previous post notified us of an Overclock.net thread, whose author claims that 3DMark's implementation of asynchronous compute is designed to show NVIDIA in the best possible light. At the end of the linked post, they note that asynchronous compute is a general blanket, and that we should better understand what is actually going on.
So, before we address the controversy, let's actually explain what asynchronous compute is. The main problem is that it actually is a broad term. Asynchronous compute could describe any optimization that allows tasks to execute when it is most convenient, rather than just blindly doing them in a row.
I will use JavaScript as a metaphor. In this language, you can assign tasks to be executed asynchronously by passing functions as parameters. This allows events to execute code when it is convenient. JavaScript, however, is still only single threaded (without Web Workers and newer technologies). It cannot run callbacks from multiple events simultaneously, even if you have an available core on your CPU. What it does, however, is allow the browser to manage its time better. Many events can be delayed until the browser renders the page, it performs other high-priority tasks, or until the asynchronous code has everything it needs, like assets that are loaded from the internet.
This is asynchronous computing.
However, if JavaScript was designed differently, it would have been possible to run callbacks on any available thread, not just the main thread when available. Again, JavaScript is not designed in this way, but this is where I pull the analogy back into AMD's Asynchronous Compute Engines. In an ideal situation, a graphics driver will be able to see all the functionality that a task will require, and shove them down an at-work GPU, provided the specific resources that this task requires are not fully utilized by the existing work.
Read on to see how this is being implemented, and what the controversy is.
A simple example of this is performing memory transfers from the Direct Memory Access (DMA) queues while a shader or compute kernel is running. This is a trivial example, because I believe every Vulkan- or DirectX 12-supporting GPU can do it, even the mobile ones. NVIDIA, for instance, added this feature with CUDA 1.1 and the Tesla-based GeForce 9000 cards. It's discussed alongside other forms of asynchronous compute in DX12 and Vulkan programming talks, though.
What AMD has been pushing, however, is the ability to cram compute and graphics workloads together. When a task uses the graphics ASICs of a GPU, along with maybe a little bit of the shader capacity, the graphics driver could increase overall performance by cramming a compute task into the rest of the shader cores. This has the potential to be very useful. When I talked with a console engineer at Epic Games last year, he gave me a rough, before bed at midnight on a weekday estimate that ~10-25% of the Xbox One's GPU is idling. This doesn't mean that asynchronous compute will give a 10-25% increase in performance on that console, just that there's, again, ballpark, that much performance left on the table.
I've been asking around to see how this figure will scale, be it with clock rate, shader count, or whatever else. No-one I've asked seems to know. It might be an increasing benefit going forward… or not. Today? All we have to go on are a few benchmarks and test cases.
The 3DMark Time Spy Issue
The accusation made on the forum post is that 3DMark's usage of asynchronous compute more closely fits NVIDIA's architecture than it does AMD's. Under DOOM and Ashes of the Singularity, the AMD Fury X performs better than the GTX 1070. Under 3DMark Time Spy, however, it performs worse than the GTX 1070. They also claim that Maxwell does not take a performance hit where it should, if it was running code designed for AMD's use cases.
First, it is interesting that AMD's Fury X doesn't perform as well as the GTX 1070 in Time Spy. There could be many reasons for it. Futuremark could have not optimized for AMD as well as they should have, AMD could be in the process of updating their drivers, or NVIDIA could be in the process of updating their drivers for the other two games. We don't know. That said, if 3DMark could be more optimized for AMD, then they should obviously do it. I would be interested to see whether AMD brought up the issue with 3DMark pre-launch, and what their take is on the performance issue.
As for Maxwell not receiving a performance hit? I find that completely reasonable. A game developer will tend to avoid a performance-reducing feature on certain GPUs. It is not 3DMark's responsibility to intentionally enable a code path that would produce identical results, just with a performance impact. To be clear, the post didn't suggest that they should, but I want to underscore how benchmarks are made. All vendors submit their requests during the designated period, then the benchmark is worked on until it is finalized.
At the moment, 3DMark seems to oppose the other two examples that we have of asynchronous compute, leading to AMD having lower performance than expected, relative to NVIDIA. I would be curious to see what both graphics vendors, especially AMD as mentioned above, have to say about this issue.
As for which interpretation is better? Who knows. It seems like AMD's ability to increase the load on a GPU will be useful going forward, especially as GPUs get more complex because it doesn't seem like the logic required for asynchronous compute would scale too much in complexity with it.
For today's GPUs? We'll need to keep watching and see how software evolves. Bulldozer was a clever architecture, too. Software didn't evolve in the way that AMD expected, making the redundancies they eliminated not as redundant as they expected. Unlike Bulldozer, asynchronous compute is being adopted, both on the PC and on the consoles. Again, we'll need to see statements from AMD, NVIDIA, and Futuremark before we can predict how current hardware will perform in future software, though.
Update @ 7:15pm: As state at the top of the post, Futuremark released a statement right around the time I was publishing.
“Yes, We’re Writing About a
“Yes, We’re Writing About a Forum Post”
+1 for smile.
Keep up the good fight Scott! 🙂
The whole story about the GTX
The whole story about the GTX 970 that ended up showing that it had different specs from those advertised, started of a forum post and a simple CUDA program.
Yup. Also, before I got a job
Yup. Also, before I got a job at PC Perspective, I would post on forums, too. Many good comments come out of them. That said, if you cannot find much information outside of forums, then it's a good idea to preface it. It could be wrong or too new.
Want to see where the
Want to see where the software goes for PC? Look no further than Xbox one for your cues…
Basically Nvidia pipelines
Basically Nvidia pipelines are too long to effectively handle parallel async because the latency is too high so 3dmark made a tailored version of async that only benefits nvidia. Am I getting this right?
no its just AMD fans looking
no its just AMD fans looking for something to cry foul with. Not everything works perfectly on their cards so they cry foul and claim biased.
lol…. right.
lol…. right.
there…there… now. let
there…there… now. let them cried! Just go back to your dream world. 🙂
You guys in the tech press
You guys in the tech press keep on making the same mistakes and fail to explain Async Compute properly. In DX12/Vulkan, it’s referred to as Multi-Engine.
An easy to understand video: https://www.youtube.com/watch?v=XOGIDMJThto
That entire video actually
That entire video actually aligns with what I said, except the last slide that claims pre-emption is not asynchronous compute.
It's definitely not what AMD defines it as, and you could make an argument that it could be misleading to define pre-emption as asynchronous compute, but definitions can't inherently be wrong. You just need to be careful when explaining why you define something in the way you do. This is basically the point of the first 2/3rds of the post.
Why does the 970 and Fury X
Why does the 970 and Fury X have 2 device context while 1080 has 4 while supposively async is turnned on.
On the async off screenshots both 970 and Fury X have 1 while 1080 has 2.
?
Asynchronous compute in
Asynchronous compute in hardware is achieved through preemption at the hardware level, context switching between processor threads on SMT(Hardware based) enabled processor cores. One processor thread stalls waiting on a dependency is preempted by the hardware scheduler and the other processor thread is context switched in after the stalled thread is context switched out. The processor’s execution pipelines are managed by the hardware to allow for more than one processor thread to operate a single logical core.
Then there is the software kind of preemption where the OS needs to preempt one software task/software thread/context and perform a higher priority task. All the modern OSs are preemptive multi-tasking OSs. Even on the application level there can be multiple software threads spawned by the parent task, and the individual software threads can have software/hardware Interrupt events delegated to the spawned threads. Windows uses an event driven OS/Application model as do other OSs. All the hardware drivers on a modern PC operate in an event driven mode by hardware and software interrupts.
The big argument is not about defining asynchronous compute itself(hardware kind/Software kind), it’s about having that asynchronous compute(hardware kind) fully implemented in the GPUs hardware down to the processor core level of hardware management of the processors on a GPU. GPUs are big networks of parallel processors that are grouped into units that, in AMDs case, have their core execution resources/pipelines also managed by hardware based scheduler/dispatch units and are also managed by the group by that hardware ACE units and hardware schedulers. This requires more hardware to manage but its makes for better core execution resources utilization and lower latency response to asynchronous events. ON Polaris there is even instruction pre-fetch too further enable the more efficient usage of the CUs’ execution resources to try and stay ahead of things.
What there is lacking in the reviews on most GPUs is the same level of per core evaluation of a GPU’s cores and their differences between the different GPU makers as there is between the relatively few core that the CPU makers have and the differences between the different CPU makers’ cores. Both AMD’s and Intel’s x86 cores are analyzed down to the smallest details but for GPUs reviewers are not going down as deep with the GPU makers new cores as they continuously do for the CPU makers cores. Maybe there should be more single GPU core specialized benchmarks to sniff out any single core deficiencies between the GPU makers cores, and single units of many cores like CUs and SMPs. Hell on CPU cores they count instruction decoders per core, reorder buffer sizes per core, ALUs, FP units, INT units, etc.
Most of the time if a GPU’s core/s get a deep dive analysis, its the top tier accelerator GPU/core that is analyzed but not the derived consumer version that may or may not have the exact same feature sets enabled. One thing is certain this time around is that there is insufficient benchmarking software, and the entire gaming ecosystem is just beginning to switch over to the revolutionary changes that have just happened with the new graphics APIs. And the GPU makers are just now using some newer/smaller chip fabrication nodes.
For sure in time there will be plenty of graduate and post-graduate academic research papers on the new GPU micro-architectures for both AMD’s and Nvidia’s new GPU accelerator products and AMD will be reentering the GPU accelerator market with its Vega line of accelerators, and its new Zen/Greenland/Vega APU’s on an interposer module so there will be plenty of white papers and other research material to put the asynchronous compute argument to the test for both the GPU makers.
“It’s definitely not what AMD
“It’s definitely not what AMD defines it as, and you could make an argument that it could be misleading”
What are the chances that what they define it as changes based on what works for THEIR hardware alone. I bet it wouldn’t be same if it works good anywhere else.
I like how you twist
I like how you twist reality.
Let’s make it more clear for the others. You will prefer to brake your keyboard before writing something not positive for Nvidia.
Nvidia’s optimizations: GameWorks, PhysX.
Results: Better performance in latest Nvidia series cards.
Much worst performance on competing cards. Questionable performance on older Nvidia series cards.
Example. Project Cars. Full of PhysX code. Dreadful performance on AMD cards. Good performance on Maxwell cards. Questionable on older Nvidia cards.
AMD’s optimizations: Better performance on GCN cards. No effect on performance on competing Nvidia cards.
Example. Doom. Excellent performance on AMD cards. Same excellent performance on Nvidia cards. NO performance loss on Nvidia cards, compared to OpenGL.
——————————————————–
One more thing to take in consideration. How companies use their exclusive techs or hardware advantages?
Nvidia: As tools to create a closed ecosystem keeping competition out. PhysX and GameWorks are closed and proprietary. PhysX and CUDA are DISABLED in case an AMD card is in the system. Nvidia forces it’s own customers to NOT use a combination of cards.
AMD: As tools to promote performance. Open, not closed. Vulkan – that is Mantle – runs on Nvidia cards and it is as optimized as Nvidia’s DX11. Before Vulkan, Mantle wasn’t disabled by the driver if an Nvidia card was present in the system. AMD doesn’t punish customers that are not absolutely loyal to the company.
One more example.
GSync. More expensive. Only compatible with Nvidia.
FreeSync. Open. Anyone can use it. Much cheaper.
I Twist reality isn’t really
I Twist reality isn’t really a twist when its the truth. Welcome to Real Life.
One more empty post that says
One more empty post that says in fact nothing.
John, please, stop spreading
John, please, stop spreading fud,this post is full of bullshit
Just to name few, you refer at GameWorks as “optimizations” Really? I thought where added effects and simulations…
You mention not better specified “AMD’s optimizations” (but I think you are talking about asynchronous compute) as would not negatively impact non GCN architecture which is not true, benchmark speak themself and I’m sure I can find a post where you say Maxwell or even Pascal suffer performance degradation once asynchronous compute is enabled.
Project Cars: Full of PhysX? none of the PhysX simulation being done are run on GPU, or you can prove otherwise?
You argued that PhysX and GameWorks are closed wile source code is actually available on GitHub.
And no, CUDA is not disabled if there is an AMD GPU in the system, only PhysX acceleration (not the whole engine as you said which would be no brainer) was and I’m not sure if still
If AMD is all that open and friendly why they had not designed mantle inside the Kronos group? and why its vice president said in a tweet that with Mantle and console they tried to play a gambit to NVIDIA?
And you accuse other of twisting?
About GameWorks. Is it or not
About GameWorks. Is it or not libraries for effects and simulations as you are saying? So if they are, do you believe they are just a pile of code that isn’t optimized for a specific architecture? You are full of BS if you say yes.
About AMD optimizations. Name ONE where Nvidia’s owners will have to suffer lower performance or lower visual quality. You will find NONE. Nvidia cards can run a game with or without async. They can run a game at DX12 or DX11 without any change in performance or visual quality. On the other hand, for example with PhysX, you had only one choice. Either lose visual quality, by choosing lower settings for physics effects or seeing ridiculous performance drops by choosing a high setting. Before you say anything, until GeForce 320 drivers you could enable PhysX alongside AMD GPUs with patch. Nvidia LOCKS PhysX.
About Project Cars. Just use google.
Nvidia opened some GameWorks libraries only recently. They don’t say if they will open any newer versions of GW libraries.
PhysX is locked on Nvidia GPUs. Installing an AMD GPU, or a USB monitor, will disable PhysX. USB monitor’s drivers are treated by Nvidia’s locking system as competitor’s GPU in the system. PhysX could be selling millions of Nvidia’s graphics cards as physx co processors. Nvidia instead choose to use physx as a way to create a close ecosystem where it’s main competitor, AMD, will be locked out. On the other hand Nvidia had the opportunity to optimize TressFX in less than a week. When TressFX came out Nvidia GPUs where having problems, but thanks to AMD’s open nature, Nvidia was able to produce a driver in less than a week, a driver that was bringing Nvidia GPUs at the same level as AMD GPUs. Can you spot the difference?
As for Mantle, it was given to Khronos group. What the hell do you think Vulkan is? And Nvidia cards running Vulkan see NO performance or quality drops.
Try not to have objections to my avatar. Try to see my text without looking at the same time to my avatar. you look at my avatar and instead of trying to understand what I am writing, it seems that you are trying to find reason to say “you are wrong” to that avatar.
A video made by who? has a
A video made by who? has a pertinent technical background? you can find videos where is said that men hasn’t ever been on the moon, that an alien reptiles race is governing us and even some which confused a game ad campaign as prove of the existence of angel…
Mahigan is a well known AMD
Mahigan is a well known AMD shill troll asskisser & a liar, amazing that AMD morons take his lies as gospel truth.
Even FM has said its not a
Even FM has said its not a complete DX12.
If you have non-compliant hardware there is no other choice then to revert to the lowest common denominator and in this case happens to be Nvidia. You can implement the CPU advantages of DX12 but not the GPU since your try’n to be fair to both vendors.
That would be known as
That would be known as gimping the games/benchmarks down to placate Nvidia and holding gaming/hardware innovation hostage to Nvidia’s business model. This is a prime example of why there needs to be enough unhindered competition in the consumer GPU and CPU market.
True fairness would require that any benchmarking software be above reproach and if there are any indications of favoritism regarding any benchmarking maker’s product then things need to be looked at from a regulatory perspective, including any violations of rules/regulations/laws already on the books FTC/other agencies. Benchmarking software should properly test all of a GPUs hardware and make no deference to accommodation for any GPU maker’s lack of innovation or lack of innovative features.
Hell, I do hope that SoftBank takes the ARM Mali/Bifrost GPU micro-architecture and makes some laptop grade discrete mobile GPU offerings, and that Imagination Technology can find a backer(Apple/other) that could do the same for its PowerVR/PowerVR Wizard(Ray Tracing hardware Units) GPU designs.
Nvidia is just one big monopolistic product segmenting interest that needs even more competitors. AMD is pouring all of its profits into R&D and innovation for its line of GPUs, including years of HBM R&D investments! And AMD still has more concern for retaining other features in its GPUs that make them good at other uses besides only gaming usage. Even Nvidia will be get a net benefit from HBM/HBM2 that is the result of AMD and AMD’s HBM memory partner’s R&D efforts. VR gaming and AMD’s investment in asynchronous compute enabled GPU hardware is going to benefit that entire GPU market place, just look at what the mobile GPU makers are doing along with AMD at that HSA foundation, and with the Khronos Group for Vulkan’s development.
Oh look, someone whining that
Oh look, someone whining that someone else in the internet is a shill and liar without any evidence whatsoever.
Surely you’re not doing the same thing as they are.
Everyone will just magically believe you now.
“Don’t toggle between compute
“Don’t toggle between compute and graphics on the same command queue more than absolutely necessary
This is still a heavyweight switch to make”
That tell us nvidia implementation Asynchronous Compute https://developer.nvidia.com/dx12-dos-and-donts
This comment is definitely
This comment is definitely not a character attack to cover up a complete lack of ability to disprove a claim.
This comment is definitely not an ad hominem fallacy.
This comment is definitely not a pile of hypocritical garbage written by someone too cowardly to stand by his accusations.
Anonymous (not verified) is definitely not an Nvidia shill troll asskisser & a liar.
Definitely.
Wow. I guess 3dmark really
Wow. I guess 3dmark really laid the smack down to AMD trolls. And with next day service too. LOL Gotta love the graphs showing compute packets in the queues.
Sure, that’s why Time Spy
Sure, that’s why Time Spy uses 21% hardware compute while Doom is at 40% and AoTS at 90%.
Petty little kid.
If AOTS is really 90% async
If AOTS is really 90% async then there would be negative performance on even AMD cards. AMD’s opengl drivers are so horrible that is why they get huge bump using Vulcan for Doom. The increase from async is still around 10%.
And still AMD has more
And still AMD has more advantage in Doom than in AoTS, something in your logic is flawed…
Also, what are those percentage about? 90% of what, rendering time? hard to believe…
Rendering time. Hmm. Doesn’t
Rendering time. Hmm. Doesn’t favor AMD in the slightest with their ungodly number of shader cores compared to an Nvidia card.
What advantage in Doom? Last I checked the top spots were 1080 and 1070. Did that change? Doom is new. Definitely more popular than AOTS (saleable) so worthy of AMD optimizing for it. No surprises here.
Power consumption of AMD video cards increases dramatically under AOTS. Don’t know if it’s a dx12 phenomenon or limited to AOTS. It’s utilizing the card to it’s fullest right along with it’s max TDP. 390x went up 122 watts under dx12 with async compared to Dx11. You can have your 10% “free” async performance for 122 extra watts. Wattage went up 58% from dx11.
http://www.tomshardware.co.uk/ashes-of-the-singularity-beta-async-compute-multi-adapter-power-consumption,review-33476-5.html
They did nothing but blow
They did nothing but blow smoke and make excuses for why they chose not to make a best effort to optimize performance for different vendor gpus. Neutrality in rendering does not work in dx12, cards have different capabilities.
If they had a benchmarks that used some sort of effect that got a speed bump with conservative rasterization, it would be perfectly legitimate to add that effect in and let the nvidia cards support for the feature boost performance while the amd card had to take slower methods to get the same effect.
Same goes for rendering a scene, if an amd card can handle a more complex method and mix of work concurrently, then the game or benchmark should not hold that back just because nvidia can’t complete the task in the same way. WTF is the point of holding back? We want to see what the cards can do, what EACH can do, not the lowest common denominator.
Scott you seem to forget that
Scott you seem to forget that Timespy is a benchmark designed to test graphic capability of video cards in dx12 and not for measuring async compute. The two examples we have are AMD biased being : AOTS and Vulkan does not properly support Pascal and async for it. Basing Furyx performance on games that are tailored to it being shader heavy does not all dx12 games make.
Geometry performance matters as well. Nvidias have that in loads. What they don’t have is a bunch of inefficient shader cores lying around waiting to be given work.
Then why do we need this test
Then why do we need this test altogether, if it’s main target is Async? Which is combining 2 different calculating powers to utilize gpu to it’s max.
And considering results in games you can’t not compare them to results 3DMark. And something stinks, oh it’s stinks. The cards showing real perfomance with async are way behind.
It’s either that AMD still don’t have a driver or NV has too much $$$.
Its interesting to see the
Its interesting to see the WIDE difference betwen game and architecture/driver.
Doom is currently 32% faster on a RX 480 then a GTX 1060 at 1440p
But the battelfield4 is 30% faster on a GTX 1060 then a RX 480
The difference is MASSIVE.
And we also see that with many other games. Where ine one the RX 480 crush the 1060 by 20%, and in another the 1060 crush the rx 480.
But here is something of BIG importance… Thermal & power limits.
The reference RX 480 is hitting a wall with its dinky heatsink,
combined with its stock ‘overvolting’ preset.
Its not an architecture problem.
When you get faster benchmark result when you lower you clock speed, you know that you are NOT measuring core GPU architecture performance. but you are compare card cooling, and other factors.
To truly compare ASYNC on/off performance, we also need to look at sustan clock speed. Because its possible thta with async ON, core clock is dropping….
There is more to this analysis. Look at power/heat/boost clock
“Doom is currently 32% faster
“Doom is currently 32% faster on a RX 480 then a GTX 1060 at 1440pBut the battelfield4 is 30% faster on a GTX 1060 then a RX 480
The difference is MASSIVE.”
Another thing Massive would be a person being an idiot buying a gtx1060 to play 1440p to start with. its a 1080p card at most.
These cards are nearly GTX
These cards are nearly GTX 980 performance .. a card which ran well at 1440p. Rx 480 maxes out doom at 1440p (60fps).
I don’t see why people even
I don’t see why people even bother with these benchmarks. They are always going to be skewed. How a game performs is going to be dependent on how much work the game or engine developers put in for each different architecture. Nvidia has a large installed base that will get targeted, but with completely unusable asynchronous compute abilities in previous generation Nvidia devices, I don’t see these getting much developer optimization. The 1070/1080 architectures will see some optimization, but developers may not take the time to optimize it as much as what 3DMark has done for this benchmark unless it is an Nvidia sponsored game. It comes down to look at what games you want to play, and see how they perform. For future games, AMD parts seem to age much better than Nvidia parts, although Nvidia has often had the lead at release time. This may change though, since games may be highly optimized at launch for AMD parts due to work on the console versions once we get engines developed from the ground up for DX12 parts.
With AMD parts, you are almost guaranteed to get very good optimization since developers need to do that optimization for the Xbox One and PS4 versions anyway. I am hoping AMD can get some Zen/Polaris APUs for laptops out in a reasonable amount of time since I need a new machine. These should perform very well for games; they will have almost the same architecture as the consoles. Also, once we get 14 nm AMD graphics against 14 nm Intel graphics, the AMD parts should leave Intel graphics even farther behind.
It looks like more engines
It looks like more engines are stating to add more advanced support for the feature sets amd is able to take advantage of.
Nitrous/frostbite 3 (will be interesting to see battlefield 1 dx12)/idtech 6/glacier 2 (hitman)/deus ex mankind divided (dawn engine – modified glacier 2)
Some of the bigger holdouts seem to be the witcher series, hopefully they get a page one rewrite for their engine to be more modern for cyberpunk 2077, UE4 (nvidia focused) Though I find it interesting that many of the cross platform games are not using unreal and releasing their own engines, probably in part because unreal does not seem to give a damn about modifying the engine to make better use of gcn on consoles. This will hurt their adoption, because in the irony of ironies, console focused engines actually help amd going forward.
Let’s not even start on ubisoft with AC unity, they REALLY need to ditch that engine and just use glacier 2, same with any future batman game.
If AMD would make cars, then
If AMD would make cars, then they also would build special roads for this cars and then proclaim: If you drive our cars on our roads you will be faster than the average car on an average road.
Whereas NVIDIA is like: Buy our cars and you will be fast on any road.
O~O
I think the more accurate
I think the more accurate analogy would be, Nvidia builds special roads for their cars that make Nvidia’s cars go faster, but AMD’s cars are banned from driving on them, and any AMD car that tries gets its tires shredded.
Then someone figures out a way to let an AMD car and an Nvidia car drive together on the same road, so Nvidia remotely shuts off the engine of any Nvidia car driving next to an AMD car.
Then Nvidia takes over 80% of the gas stations in the country and starts offering a special gas that makes Nvidia cars run faster and makes AMD’s cars run like ass – oh, and that’s the only gas you can buy.
Then AMD builds their own roads that make their cars run a whole hell of a lot faster, and Nvidia cars can drive on them too and go faster, only the Nvidia drivers all bitch and moan and cry because AMD cars get a much bigger boost in speed, and they all dismiss the AMD roads offhand, claiming that the old roads are so much better.
Ya know, if accurate analogies are important to you or something.
Oh, also, don’t forget when
Oh, also, don’t forget when Nvidia managed to rig some of the roads so that they would really hobble AMD cars, but barely slow down Nvidia cars, if at all. And when AMD figured out how to prevent it, all the Nvidia drivers cried and said they were cheating.
Say something against NVIDIA
Say something against NVIDIA and nobody bats an eye.
Say something against AMD and everybody loses their minds.
^_^’
Let me put it this way:
Nvidia has always been better getting the best out of their hardware by always strongly optimizing their software for users and developers. And making it as user-friendly as possible from the beginning.
Now they are collecting the fruits of that.
Whereas AMD never did reach the same level of software-optimization; instead they leave the full work to be done by the game developers.
But I’m quite sure Nvidia will adopt hardware asynchronous computing in the next generation GPU, if they think it’s necessary.
It also would be nice if they supported Freesync, as it’s an open standard. I don’t think their proprietary implementation has a chance to survive in the future.
And of course by then there will be plenty of games with DirectX 12 Support, which will change everything.
At the moment for most(!) users it’s just like nice to have. As they even don’t have any idea what it is.
I’ll bet you are quick to
I’ll bet you are quick to defend M$ and Comcast also! Nvidia is the big GPU interest and people see how Nvidia overcharges for their hardware and gimps the SLI on the GTX 1060! And to get at GP100’s finer instruction granularity in software requires CUDA, but who knows if any of GP100’s improvements are in GP104 or GP106. Nvidia is sure not going to be able to gimp things for any GP100 based HPC/Server SKUs and Nvidia better have some OpenCL compatibility.
No one likes a monopoly interest that further segments its consumer product offerings so Nvidia’s bad karma has come back in the form of Vulkan and DX12 giving the games developers the ability to manage the GPU’s hardware resources in a close to the metal fashion. Nvidia is the control freak of the GPU world, and the great product segmentation specialist and that’s what has caused many to express their dislike of Nvidia in many online forums. It will be very good for the entire GPU market in general, including Nvidia’s users, if AMD takes more market share as that will force Nvidia to invest more in R&D and get off of its ego trip and get that asynchronous compute fully implemented into its GPU’s hardware, and that includes Nvidia’s consumer SKUs.
Truth is the truth no matter
Truth is the truth no matter what direction you shine the light on it. AMD does seem to get the pass from the people over and over where as nvidia does something wrong people wants heads to roll.
Go work public relations for
Go work public relations for Comcast, Intel, and M$, And the green goblin gimpers at Nvidia! you love your big monopolies! Nvidia is a GPU monopoly that works with its willing game partners to use software/middleware to lock out any fair competition!
No one liked Ma Bell or the Standard Oil Trust back in the day also, and folks see how Nvidia segements its product lines to milk for excess profits the many fools like yourself. Those dual RX 480s and some Vulkan will Doom Nvidia’s attempts at cornering the GPU market!
“Gimp”, the most beloved word
“Gimp”, the most beloved word by AMD fans…
I don’t think anyone can claim NVIDIA don’t spend or invests little in R&D, especially looking at their expenses report.
Just a question, who is lacking behind with product performance? NVIDIA? maybe Intel? or maybe AMD?
What there needs to be is a
What there needs to be is a benchmark that can measure underutilized execution resources on a GPU. If there is a way to measure/benchmark accurately any idle GPU execution resources while there is still work backed up in the queues.
There is two types of asynchronous compute, the hardware kind and the software kind. Just look at Intel’s version of SMT, HyperThreading(TM), and see the hardware version of asynchronous compute in action on Intel’s CPU cores and see why Intel’s CPU cores get the extra IPC boost and extra per core execution resources utilization that would go to waste had there not been SMT hardware to dispatch/schedule/and preempt two or more hardware processor threads running on/sharing the same CPU core’s execution ports/pipelines.
There is no way in hell that Intel’s, or any other SMT based CPU core before Intel’s adopting of SMT, could manage a CPU’s execution ports/pipelines in software, as software is not fast enough to react to the changing states inside a CPU’s or any other processor’s execution pipelines.
GPU’s are no different from CPUs in this respect for hardware based SMT like/asynchronous compute scheduling of multiple processor threads on a single core unit(Instruction decoding/instruction scheduling/dispatching of hardware processor threads, etc.) So no software scheduling of a processor’s core execution resources among shared hardware processors threads is ever going to be able to keep the processor’s core execution pipelines utilized in a fast enough manner to avoid the need for execution wasting pipeline bubbles(NOPs) to be inserted if one processor’s thread stalls and another needs to be quickly started up to make use of the processor’s execution pipelines. These execution pipelines are operating at a faster state change than even a single op code instruction takes to be fetched/Decoded and scheduled/dispatched on a processor’s execution pipelines(FP, INT, etc.)
Sure software can manage to a degree the scheduling of deterministic workloads on a processors’ core, but no software solution can take up the slack if a non deterministic asynchronous event occurs that requires an immediate stopping of work on the current processor thread’s workload and the context switching and scheduling/dispatching of another thread’s execution on the same processor core. A lot of the hardware based asynchronous compute on a processor core happens at below the single instruction time interval and that requires specialized in the hardware asynchronous compute units/engines. Any single core processor without any SMT ability in hardware is not going to be able to make an as efficient utilization of the processor’s execution ports/pipelines and there will be plenty more NOPs/Pipeline bubbles if the single processor thread stalls, or that single processor single processor thread is preempted and needs to be context switched out and another higher priority thread needs to be context switched in and worked on.
So what is needed is benchmarking software that can count the pipeline bubbles and that is a very tall order to achieve short of having access to the unpublished Instructions on any processor that the processor maker has to test the cores on its products. Processor have been built all along with undocumented instruction to allow for that processor maker to single step through not only the single instruction, but also look at the pipeline stages etc. There are ways to infer some things with specially crafted processor assembly code from a processors compiler optimization manual to measure some efficiencies, but short of having some proprietary information it is very hard to get any deeper inside some of the testing mode instructions that all processor makers build into their processors.
Nvidia is very good at managing in software its GPUs execution, but there is always some more idle execution resources as the cost for managing hardware core execution resources with software, especially when there are lots of non deterministic events happening and no readily available cores to schedule the work to, things get queued up fast. There is also the inherent underutilization of core execution resources because software is never going to be fast enough to manage multiple processor threads on a single core and keep the pipeline bubbles to a minimum. Those processor pipelines need to be managed by hardware based asynchronous compute units.
It’s a bit odd that the
It’s a bit odd that the article is using Javascript as analogy – I guess Scott is a web developer?
But anyway, the whole “conspiracy” is completely overblown. Futuremark never claimed that Time Spy benchmark is an async compute benchmark. It’s just a dx12 benchmark that uses all dx12 features which happens to include async compute.
If there are little to no compute tasks to parallelize, or if it’s bottlenecked by graphics workload (which it typically are, on synthetic benchmarks), then obviously the benefits gained from going async is going to be smaller.
Yes Java script/other script
Yes Java script/other script is about as far away as one can get from the real metal on any processor(CPU, GPU, Other)
It’s too bad that Anand Lal Shimpi is no longer writing for AnandTech. That would be a great review for an outside of a pay-walled publication. Just wait until AMD gats back into the HPC/Server accelerator market, then there will be some really good benchmarks to test things really thoroughly.
hey scott, this is pretty
hey scott, this is pretty weak analysis. can you interpret gpuview and explain it in laymans terms?
Thanks for your time and the
Thanks for your time and the article Scott Michaud.
Another important distinction
Another important distinction to make is between Asynchronous Compute and Asynchronous Shading. Asynchronous Shading is one way to IMPLEMENT Asynchronous Compute, but it is not the ONLY way. You can quite happily schedule calls at the driver level to pack them efficiently for execution, just as you can implement that scheduling in hardware. Unless you’re hitting a CPU bottleneck (and when has THAT happened recently?) both GPUs will be executing equally well packed instruction calls. Doing the rescheduling in software has the advantage that you can apply rescheduling to workloads that are not explicitly designated, and that you can change scheduling criteria if required (e.g. some engine does something weird with it;s dispatch) without a hardware revision. The downside is all the coding effort to get the software scheduling to work, and the potential for a CPU overhead.
Tell Intel to schedule its
Tell Intel to schedule its CPU processor threads in driver software, and see how Intel’s version of SMT, HyperThreading(TM), would NOT be able to keep its CPU core’s instruction execution pipelines properly and efficiently utilized. And a shader core with a fully in hardware based version of asynchronous compute on the shader is no different from a CPU in this respect.
There is no way that Intel’s HyperThreading(TM) could be adequately managed by software, when on Intel’s CPU cores the pipeline states change faster than any single software instruction(let alone the many instructions in most driver code to manage the most simple functionality) could manage any quickly changing hardware asynchronous events like a pipeline stall, and the processor thread context switch that has to be managed as quickly as possible or the execution pipelines would sit idle executing NOPs(No Op instructions). Those execution pipeline bubbles(NOPs) would be very numerous if Intel did not manage its CPU core’s asynchronous compute fully in its CPU’s hardware.
“Tell Intel to schedule its
“Tell Intel to schedule its CPU processor threads in driver software, and see how Intel’s version of SMT, HyperThreading(TM), would NOT be able to keep its CPU core’s instruction execution pipelines properly and efficiently utilized. ”
Uh, Hyperthreading does indeed rely on software scheduling of jobs. Without it, it does nothing of worth (and if your OS handles things badly, even makes performance worse as seen with the Bulldozer scheduling issue). Instruction Level Parallelism is done in hardware, but it is done in hardware on GPUs of every architecture too.
Hyperthreading exposes a single physical core as two logical cores. But in order to use those two logical cores, you need to feed it two logical threads, and the scheduling for that is done in software, not in hardware. If you feed a hyperthreading core a single thread, it will be underutilised, because it can only do so much at the instruction level to parallelise workloads without ending up with large parts of the core sitting and spinning waiting fo the rest of the thread to catch up.
The entire POINT of Hyperthreading is to allow another SOFTWARE SCHEDULED thread to be pointed at that core to utilise those parts of the core that are UNDERFED with Instruction Level Parallelism alone!
“Without it, it does nothing
“Without it, it does nothing of worth (and if your OS handles things badly, even makes performance worse as seen with the Bulldozer scheduling issue).”
To clarify, I mean that without software scheduling accounting for it you can have issues with actually achieving theoretical efficiencies. With Bulldozer, the issue was not SMP but the heterogeneous distribution of FPUs, where assigning two threads to two core sharing an FPU was less performant than assigning each of those threads to two cores not sharing an FPU.
For SMP, when NOT in a saturated condition it would be more performant to assign threads to separate physical cores, and only start assigning to logical cores sharing physical cores once this is saturated. Scheduling for power efficiency would be the opposite condition.
Your hair splitting won’t
Your hair splitting won’t work SMT/HyperThreading(TM) is fully on the hardware level on Intel’s CPUs just look at the diagrams that Intel provides for its version of SMT. The hardware scheduling on Intel’s version of SMT is done with specialized fully in the core’s hardware schedulers/dispatchers and the processor threads are fully managed by the hardware. You are confusing the OS/Software kind of threads with the hardware kind of threads that are not “software” based and operate at the sub instruction level on Intel’s and others’ CPU cores. Software threads can be infinite in numbers to a reasonable degree, but processor threads in the hardware are limited by the hardware on Intel’s consumer SKUs to 2 logical processor threads per physical core.
Intel’s SMT/HyperThreading(TM) operates on the decoded instructions below the Native assembly level of the X86(23/64 bit) ISA as it is implemented on Intel’s microprocessor cores. You can not wrap your mind around the differences between the software thread and a processor thread abstract concepts in computing. You need to be reading some CPU deep dive primers into the hardware concepts surrounding SMT(Simultaneous MultiThreading) as it is done in hardware.
The OS only tasks the processor with a Software Thread(single stream of native code instructions per logical core) but the OS does not manage on the core that stream of instructions. That management of the software threads after they are tasked by the OS is the job of the hardware scheduler/dispatcher in the processor’s core/s and the OS is not aware of any of that level of work happening on the CPU’s core/s. In fact on Intel’s SMT enabled SKUs the OS does not even Know that the two logical cores are in fact being hardware Simultaneously Multi-Threaded on a single CPU core, the threading being done fully in the CPU’s core is only known to the hardware on that CPU and the OS is none the wiser. You do know that on Intel’s CPU cores that are also SMT enabled that some instructions can be scheduled and executed out of the logical order in which that came into the logical core and dispatch/scheduled by the hardware units to keep the execution pipelines occupied and utilized at as close to 100% utilization as is possible, the OS/software side never sees this hardware side or plays a role in the management of the out of order execution on Intel’s SMT or OOO enabled cores.
The OS only passes the address in memory of the first instruction, or pushes the address onto the top of the stack, etc. for the CPU to get to work and the CPU core takes it from there. The CPU core does all the fetching, decoding and instruction scheduling/dispatching, instruction reordering for out of order execution, and branching/branch predicting, and speculative execution. And that work includes the hardware asynchronous processor thread management to work the two logical core processor threads on a single processor core.
Furthermore the OS itself, and the drivers, and the applications, and APIs are all made up of native code that is running on the CPU/cores, so how would any software made up of single assembly language instructions be able to manage any hardware that operates on instructions at the sub assembly language level on execution units and execution pipelines that change states faster that even the system clock. You do know that on Intel’s/Others CPU cores there are specialized hardware units that operate at a clock multiplied rate to the main core clock in order to manage the execution pipelines, and other scheduler/dispatcher units that need to manage the execution pipelines state changes and stay ahead of the instruction streams to keep the CPU core’s execution pipelines efficiently utilized. The same rules apply for a GPU’s many more cores that need to manage their asynchronous compute fully in the GPU’s hardware on GPUs with the hardware that enables such work(ACE units/Hardware schedulers and hardware asynchronous shaders, etc.)
The OS passes the work to the CPU/Processor but the OS runs on the CPU/Processor so the OS can not run without the help of the processor, the processor runs the OS’s code, and the processor even runs the code before the OS itself is even up running on the CPU’s/Processor’s cores. You need to bootstrap your mind to load up a proper understanding of just what the difference is between processor “threads” and software “threads” are on any type of processor with hardware asynchronous compute abilities.
“You need to be reading some
“You need to be reading some CPU deep dive primers into the hardware concepts surrounding SMT(Simultaneous MultiThreading) as it is done in hardware. ”
Ironic, as you appear to have missed a fundamental difference between SMT and GPU threading: a SMT CPU is processing two threads using a single core. A GPU is feeding many cores from a single thread. In addition, the way jobs are partitioned for CPUs and GPUs is radically different.
As I stated before, both Intel’s CPUs and GPUs, AMD’s CPUs and GPUs, and Nvidia’s GPUs ALL implement Instruction Level Parallelism in hardware (because that is the only way to feasibly do so). It is job-level parallelism that is the issue with Asynchronous Compute, not instruction level.
The linked futuremark article
The linked futuremark article is extremely interesting, a definite recommendation! Thanks Scott for addressing this topic. The discussion is far from finished, but we need solid reporting from guys like you to arrive at the truth.
Looking forward to round 2
Looking forward to round 2 when both teams have new drivers out.
PS: Fucking teams everywhere.
– Incomplete coffee addict.
Just call it what it is, am
Just call it what it is, am nvidia only benchmark.
You know what’s going to be
You know what’s going to be amazing at AC? GTX 2080 and RX590. LOL
Futuremark’s response looks
Futuremark’s response looks like a typical political spin for damage control.
You read an in-depth
You read an in-depth technical discussion about how/what/why they do, along with confirmation that all GPU vendors (AMD included) are involved in the process, as 'political spin'?
I have no problem with
I have no problem with someone or a company who chooses to something their own way(in this case how AC is implemented), but the timing of their statement and wording of it reeks of them trying not to be seen as trying to play favorites, that IS political spin for damage control.