It’s Basically a Function Call for GPUs
… and what do they mean for actual performance?
Mantle, Vulkan, and DirectX 12 all claim to reduce overhead and provide a staggering increase in “draw calls”. As mentioned in the previous editorial, loading graphics card with tasks will take a drastic change in these new APIs. With DirectX 10 and earlier, applications would assign attributes to (what it is told is) the global state of the graphics card. After everything is configured and bound, one of a few “draw” functions is called, which queues the task in the graphics driver as a “draw call”.
While this suggests that just a single graphics device is to be defined, which we also mentioned in the previous article, it also implies that one thread needs to be the authority. This limitation was known about for a while, and it contributed to the meme that consoles can squeeze all the performance they have, but PCs are “too high level” for that. Microsoft tried to combat this with “Deferred Contexts” in DirectX 11. This feature allows virtual, shadow states to be loaded from secondary threads, which can be appended to the global state, whole. It was a compromise between each thread being able to create its own commands, and the legacy decision to have a single, global state for the GPU.
Some developers experienced gains, while others lost a bit. It didn't live up to expectations.
The paradigm used to load graphics cards is the problem. It doesn't make sense anymore. A developer might not want to draw a primitive with every poke of the GPU. At times, they might want to shove a workload of simple linear algebra through it, while other requests could simply be pushing memory around to set up a later task (or to read the result of a previous one). More importantly, any thread could want to do this to any graphics device.
The new graphics APIs allow developers to submit their tasks quicker and smarter, and it allows the drivers to schedule compatible tasks better, even simultaneously. In fact, the driver's job has been massively simplified altogether. When we tested 3DMark back in March, two interesting things were revealed:
- Both AMD and NVIDIA are only a two-digit percentage of draw call performance apart
- Both AMD and NVIDIA saw an order of magnitude increase in draw calls
Read on to see what this means for games and game development.
The number of simple draw calls that a graphics card can process in a second does not have a strong effect on overall performance. If the number of draw calls in the DirectX 12 results are modeled as a latency, which is not the best way to look at it but it helps illustrate a point, then a 10% performance difference is about five nanoseconds (per task). This amount of time is probably small compared to how long the actual workload takes to process. In multi-threaded DirectX 11, NVIDIA held a lead over AMD by about 162% more calls. This almost three-fold increase in draws, which is a precious resource in DirectX 11, was evaporated in DirectX 12. In fact, it was AMD who held about a 23% lead in that API, although DX12 calls are more plentiful than they were in DX11. Are draw calls no longer a bottleneck in DirectX 12, though? We'll see.
If they're able to see the whole level, that's ~9000 draw calls.
Many can be instanced together, but that's added effort and increases GPU load.
This brings us to the second point: both vendors saw an order of magnitude increase in draw calls. When this happens, developers can justify solving their problems with smaller, more naive tasks. This might be able to either save real development time that would be spent on optimization if DX11 can be ignored, or it may allow a whole new bracket of cosmetic effects for compatible systems. This is up to individual developers, and it depends on how much real-world relief it brings.
A couple of months ago, I talked to a “AAA” game developer about this. He was on the business side, so I focused the conversation on how the new APIs would affect corporate structure.
I asked whether this draw call increase would trickle into the art department and asset creation. Specifically, I inquired whether the reduced overhead would allow games to be made on smaller art budgets, and/or this permit larger games on the same budget. Hypothetically, due to the decrease in person-hours required to optimize (or sometimes outright fake) complex scenes, the artists would spend less time on the handful of difficult assets that require, for instance, multiple materials or duplications of skeletal meshes, each of which are often separate draw calls. For instance, rather than spawning a flock of individual birds, an artist could create a complex skeleton animation for the entire flock to get it in one draw call. This takes more time to create, and it will consume extra GPU resources used to store and animate that hack too, which means you will probably need to spend even more time elsewhere to pay that debt.
A nine-bone skeleton even looks like a terrible way to animate three book-shaped birds.
But… it's one draw call.
This apparently wasn't something that the representative thought much about but, as he pondered about it for a few moments, he said that he could see it leading to more content within the same art budget. This hesitation surprised me a bit, but that could have just been the newness of the question itself. I would have expected that it would have already influenced human resource decisions if my hypothesis was true, which wouldn't require time to reflect upon.
But other studios might be thinking of it.
Ubisoft's CEO mentioned in an investor call that Assassin's Creed: Unity was the product of redoing their entire engine. Graphics vendors state that amazing PC developers should be able to push about 10,000 to 20,000 draw calls per frame with comfortable performance. This Assassin's Creed, on the other hand, was rumored to be pushing upwards of 50,000 at some points, and some blame its performance issues on that. It makes me wonder how much changed, company wide, for an instantaneous jump to that many draw calls to have happened.
Ubisoft took the plunge.
We might not see the true benefit of these new APIs until they grow in popularity. They have the potential to simplify driver and game development, which the PC genuinely needs. Modern GPUs operate a lot closer to their paradigm in GPU Compute APIs, with some graphics functionality added, than they did in the 1990s versions of DirectX and OpenGL. Trying to shoe-horn them into the way we used to interface with them limits them, and it limits the way we develop content for them.
This (mostly) isn't free performance, but it frees performance the more it influences development.
In the UE3 days, we used to
In the UE3 days, we used to have that process where we would need to put together all the level mesh (for example all the trees) and merge them into one in order to save on the draw calls. This did take a considerable amount of time to do so I can imagine the possible gains in terms of having more time to polish the assets rather than spending time on such tedious optimizations.
You can do that in UE4 as
You can do that in UE4 as well, as it is extremely useful for mobile gaming. What it is essentially doing is taking all the meshes you have selected and all of their textures, and turning them into one set of images and one mesh structure, that can be sent to the GPU together. There are issues with fidelity, as the texture can only be so big, but it can be quite useful.
Not to mention that it
Not to mention that it increases GPU load and memory usage, especially if a fraction of them would have been culled. That means you'll need to optimize elsewhere to pay for the trade-off your first optimization required.
Ryan, might be time to retest
Ryan, might be time to retest the draw calls see if they are faster in final win10 build. On my gtx980 i was getting 18mill, with some results i seen posted had 19mill.
One of the big benefits of
One of the big benefits of these new APIs is less CPU usage for properly made engines. This will mean laptops will get better performance from their resources, making laptops, the biggest PC consumer device product, much better suited to gaming.
I would be very curious to see the benchmark numbers for some of these laptops you have been reviewing with Win10, with DX12 and DX11 comparisons. I think the numbers we see there might actually show a larger difference, which is frankly exciting.
Sadly there is no games that
Sadly there is no games that uses DX12 just API overhead tool so.
True, but the only number we
True, but the only number we have comparisons for are the demo tool anyway, so it would be useful for that purpose. Also, we will start seeing games available next month, such as the Beta of Fable Online, that support DX12.
UE4 has alpha support for DX12 in its 4.9 release, which will allow games to start shipping with DX12 in weeks. There are already some Demo scenes that have been released with the alpha version of UE 4.9.
Also, the code samples were
Also, the code samples were released (IIRC on July 29th). This will allow small, hand-coded apps, too.
Pair the Vulkan, and other
Pair the Vulkan, and other newer APIs, with the latest ACEs on the GPU from AMD, and Nvidia’s equivalent, and even more can be done on the GPUs cores without any CPU intervention. So expect more draw calls to ultimately be under the control of the ACEs and things will become drastically more efficient for gaming. If AMD’s current patent filings, as well as future HPC offerings announcements are concerned that can/will be made/incorporated into consumer products, expect there to be on the HBM stacks between the logic chip and the memory die stacks a FPGA to assist the GPU with gaming workloads for things such as decoding/compression, or even ray tracing enchantments for natural lighting affects and other related processing right in the HBM memory stacks to assist the GPU and offload work from the CPU, and GPU to the FPGA! including adding into the FPGA any new Vulkan, Direct X, or other API enhancements into the FPGAs Logic with an update. An AMD gaming APU dedicated to gaming/graphics appears very likely, with the whole system laid out on an interposer including the FPGAs programmed to run some tasks right inside the HBM stacks themselves for even more realistic gaming at even lower latency. Whole gaming systems on an interposer are not too far off from becoming reality, and AMDs HSA aware APUs on an interposer are going to offer some serious gaming workload power distributed across all of the processing devices on an interposer. I see the motherboard CPU, if it can not become more System on an Interposer like, or APU on an interposer like, as being cut out of any future computing simply because of the drastic improvements that HBM, and other processing power on the interposer will bring to the home user.
SIGGRAPH is going to become a very interesting event to cover once all of the HSA software begins to come online, especially for graphics and gaming on the future full interposer based systems.
Your post is difficult to
Your post is difficult to read due to the run-on sentences and such. FPGAs are going to be playing a larger part in computing in general. With smaller processes, we can include more and more functionality on-die. You generally get a free GPU with any CPU because of this. FPGAs can be reprogrammed to run some task much more efficiently than general purpose hardware. They can be programmed to handle new media codecs easily along with many other specialized compute task. It makes a lot of sense to include these in the package with an APU.
It is unclear what AMD will be using FPGA base die for. They could be programmed to carry out some task. Initially they may just want to use the FPGAs reconfigurable routing to route around defective micro-bumps. It would allow memory compression techniques to be moved off the GPU and onto the base logic die of the HBM stack. The current base logic die may not have much functionality beyond the memory interface which may not use up all of the die space available. The die needs to as big as the memory die. You may be too optimistic on what such a setup will be used for. With silicon interposers, a lot of interesting designs are possible. These all require significant design effort though.
I never really looked much
I never really looked much into AMD's (and Intel's with their purchase of Altera) FPGA work, but I always assumed they would just make an OpenCL library for it and put it in enterprise parts, so those customers could write ASIC-like offloaded functions.
This is what Altera has been doing with their add-in boards.
The consumer market is a bit more difficult, because the multi-hour bake time is bad UX. That said, we are getting used to several-hour download times. As long as it's something that games will not need forked versions for, it could work. Seriously doubt it, though.
If you could bake some newer
If you could bake some newer Vulkan or DX12, functionality into an on-board FPGA, that few hours bake time would be worth it. It’s not like an everyday usage thing, and the new FPGA logic to be baked in for the consumer market would most likely come from the device’s OEM the same as the BIOS updates come from the graphics card maker. So gaming APU’s would come with whatever the latest graphics API functionality that was available and included in the ASIC hardware at the time and any newer API functionality update-able by the device’s OEM onto the FPGA. Who ever said that the this was to be an everyday part of the UX, the FPGA programming would be more like a rarely performed update function.
I would think in the consumer
I would think in the consumer market that this would allow for new features to be added. But I doubt those would be from games. ASICs are made to accelerate specific workloads, not varied workloads.
I can see one area a consumer might use this, however. Video codecs. Imagine installing a VP9/10 or Daala hardware Encoder/Decoder by downloading a program. This is important for the NETVC spec that is being worked on, where an actual finished spec isn’t expected for more than a year, this would allow AMD to accelerate streaming web video on these APUs.
I find this more likely than a feature for DX12.1 or Vulkan 1.1 or whatever. I could be wrong.
Now, a standard OpenCL Game specific library I could see possibly being used, but games are really to different even between games that use the same Engine to really accelerate much of anything on an ASIC the size they use.
The FPGAs are not for
The FPGAs are not for controlling the HBM as that is the job of bottom of the stack logic chip! The FPGAs are there for Performing tasks via the FPGA’s programmable logic, and one example of possible usage includes the use of the FPGA’s for processing in programmable hardware some newer Graphics API functionality (programmed into the FPGA logic) that the GPU ASIC logic does not have.
The AMD APU/HBM/FPGA patent application has FPGAs located Between the nominal bottom HBM die logic chip and the memory Die stacks above. The FPGA is for distributed computing where in addition to the GPU and the CPU there is an FPGA/s that can work from the memory above on the die stack and perform processing on that data, or code. This is the intended usage of the FPGA on the memory die stack and the FPGA is included along with the other normal HBM components. The FPGAs(one per HBM stack) in this arrangement could be programmed for other tasks like data compression, or whatever logic was needed for the task at hand in addition to and including helping the CPU, or GPU, with specialized processing that neither the CPU or the GPU has been designed for. So it’s adding the to the overall processing capability of the APU for HPC or other workloads.
These FPGAs do not have to wait for a smaller process node to be included on die with a CPU or GPU in the first place! That space is better used for other things such as more GPU/CPU cores and such. The FPGAs in the patent application are already fabbed/can be fabbed on whatever process node is most affordable for their inclusion on the APU on an interposer system. The FPGAs are just another chip on the HBM Die stack and are interfaced via the TSVs and micorbumps the same as the other chips on the HBM stack. This is definitely an innovative use of the limited amount of interposer space available with current interposer manufacturing techniques. Imagine an HPC APU system on an interposer with FPGA’s in addition to the CPU, and the GPU With the FPGAs conveniently located on the HBM die stacks, and taking up only a little more vertical space on the HBM stack and no extra horizontal space interposer. The perfect place for some extra programmable logic.
Looks to me like MS took AMDs
Looks to me like MS took AMDs thunder. No hard feelings. make you wonder.
I have been wondering if a
I have been wondering if a larger number of smaller draw calls will reduce memory usage. It seems like it would reduce the size of the active working set which would make caching more efficient. Also, will splitting the workload up into smaller chunks help distribute the load across multiple GPUs? It would be great if we could actually have multiple GPUs work together efficiently to produce a single frame. Multi-GPU techniques currently used seem to provide little benefit beyond dual setups due to using things like AFR.
Basically, yes.
If you look
Basically, yes.
If you look at it from a very simple standpoint: draw calls are a resource. They don't map to any real, physical process. It's not something like "number of cores" or "frequency" that is limited by physics. It's an overhead. Reduced overhead means you have more "draw call" resource.
When you are running low on this resource, you will need to compensate elsewhere. This can be an increase of engineer person-hours. This can be changing your algorithm to one that uses more RAM, GPU, longer loading screens, worse results, compatibility, hard drive bandwidth, or something else entirely. Something must give.
If you have more of that resource, then you are able to adjust your algorithms (and data that they run on) to use it instead of RAM, GPU, loading screens, worse results, engineer time, compatibility, and so forth.
What it will save you depends on what engineers needed to do to compensate for it then (which they may not even realize because of how routine it was).
If, for instance, an artist combined all foliage in a scene into a single object, then you will make up for whatever resources are spent loading things you can't see, drawing things you can't see, simulating things you can't see, simulating things that never even interact, etc. All of that can be recovered if you separate the object into its logical subobjec… oh wait you ran out of draw calls.
Also, multi-GPU is a similar concept. If you are able to split your tasks into independent work groups, then having independent workers will help you. You will likely be limited by bandwidth and latency, but taking the time to be clever could be worth it for the extra GPU power. Note the trade-off between GPU and bandwidth, and bandwidth and person-hour. That's important.
This is very true. It is all
This is very true. It is all a tradeoff. And copying a finished piece of the scene to the GPU that actually puts the image on the monitor is costly. VR will benefit as the cost to send the same data to both GPUs can hopefully be amortized and the performance per GPU will remain the same as per the GPU.
The real benefit might be to come up with an algorithm that takes a lot of the scene independant compute and throw that onto the second GPU, stuff like real time scene lighting that doesn’t really change much from one scene to the next, which is what allows lighting baking in Engines. Finding what those are will be the fun thing Engine dev will be doing in the next few years, I think.
After all, if you can get one GPU doing most of the texture operations, and your textures take up a majority of your game, keeping them all on one GPU should be a major advantage that Nvidia and AMD couldn’t really accomplish without your Source code before.
“A nine-bone skeleton even
“A nine-bone skeleton even looks like a terrible way to animate three book-shaped birds.” that skeleton is supposed to be attached to nice approximation of a bird/s in a higher resolution mesh so what is in a bone that will never be rendered its just there to warp the mesh and make it appear to be flapping, just take one bird mesh have it parented/attached(rigged) to the skeleton a little key framing and the animation is done to the skeleton movements throw in an array modifier and a v shaped path with a curve modifier, and presto a whole flock of migrating birds. Nice feathers are not to difficult to replicate with an array modifier and curve modifier and “Hooked” in to the wings/birds mesh vertices for a nice realistic look and feel. Note: getting the mesh to warp properly and the skeleton rigged to the mesh is not as easy as its sounds.
Yeah. Honestly, it was just
Yeah. Honestly, it was just an example that I whipped up in Blender to look ridiculous. Attaching an octopus-like skeleton to the roots of skeletal objects, making them one, is a straightforward way to reduce draw calls. Better to just spawn the objects directly, though.
Rigging meshes for movement
Rigging meshes for movement with bones is not the easiest thing to do, but it’s nice to be able to replicate a finished object if you need more than one in a scene then just get rid of the duplicates when you no longer need them, it’s clone and de-clone, among other things in Blender. I wonder what new things the Blender project is going to be demonstrating at SIGGRAPH this year, I know that at least Blender now has AMD cycles rendering support started for GCN based GPUs, so that will make the GPU side of the costs more affordable. That grouping to a logical object for draw calls sounds like a good idea, as well as the choosing the best sub-pixel algorithms to suit the geometry.