The Tech Report takes you on a walk through NVIDIA's HPC products to show you just what is interesting about the Tesla P100 HPC which Jen-Hsun Huang introduced us to. The background gives you an idea of how much has changed from their first forays into HPC to this new 16nm process, 610mm² chip with 56 SMs. If you missed out on the presentation or wanted some more information about how they pulled off FP16 on natively FP32 hardware or how the cache of this chip was set up then click on over and read it for yourself.
"Nvidia's GP100 "Pascal" GPU launched on the Tesla P100 HPC accelerator a couple weeks ago. Join us as we take an in-depth look at what we know about this next-generation graphics processor so far, and what it might mean for the consumer GeForces of the future."
Here is some more Tech News from around the web:
- Microsoft delivers new previews of Windows Server 2016 and System Centre 2016 @ The Inquirer
- Time for a patch: six vulns fixed in NTP daemon @ The Register
- Searching for USB Power Supplies that Won’t Explode @ Hack a Day
- Hackers so far ahead of defenders it's not even a game @ The Register
- Trouble at t'spinning rust mill: Disk drive production is about to head south @ The Register
- Tech ARP 2016 Power Bank Giveaway
I’m reading the posts over
I’m reading the posts over there and there appear to be a lot of questions about Pascal and Async-Compute. There was an article that ran on The Register or its sister publication The Next Platform that stated that the Polaris design has improved on its processor thread scheduling granularity issues and that scheduling was now fixed down to the single instruction/SIMD level. There definitely needs to be more comparsion of Polaris and Pascal’s actual Async-Compute abilities on a more detailed level and there is a link to a Nvidia Pascal white paper at The Tech Report, so I geuss it’s time for some reading.
The only Pascal that has been
The only Pascal that has been announced is a tesla card, a-sync means next to nothing for that card do to different kinda work load it does. Best to not guess or make up stuff without any info on it. Its safer bet that nvidia has worked on that and it will be work fine in pascal even being former AMD locked tech.
There is a link in an earlier
There is a link in an earlier article on PCper that links to the article that states Pascal has improved its processor thread scheduling abilities/granularity. So go figure, I have read the article and I’d post the links but the spam filter is hell sometimes on this website!
So you, as usual, arbiter do not know what you are talking about and the server Pascal variant has better async compute features and Nvidia’s consumer variants better get that same or Nvidia will be behind for VR gaming! Gaming that needs more non-graphics gaming compute done on the GPU to reduce the CPU to GPU latency issues for VR gaming. So go read the article on your own, and work on your Google-FU researching skills so you do not need any diaper changing or hand holding! FFS some the GITS in gaming world are not qualified for shovel horse manure or understand the finer details of processor technology. And the professional server market websites have a better handle on the GPU processor technology than any enthusists’ websites, and now that Nvidia and AMD(Hopefully) are getting even more of their SKUs onto server/HPC systems the information will about GPUs will be better for any who take the time to read!
edit: will about
to: about
edit: will about
to: about
It seems that for current
It seems that for current Nvidia GPUs, you can only preempt in between draw calls. I don’t think that can really be fixed via software; the low level granularity can’t be emulated easily. Developers that can get better performance by using asynchronous compute will have to implement a separate code path to support current Nvidia GPUs. That is a pretty large installed base, so they should support it for a while. With Pascal probably improving the scheduling granularity support, current Nvidia GPUs will be left behind relatively quickly, even if it seems like they should have sufficient compute resources. I don’t think this is anything new. I have seen many people complain about the performance of slightly out-of-date Nvidia GPU not being as high as expected. Planned obsolescence sells more GPUs, as long as the customers are still willing to buy a new one.
I would not assume that Nvidia’s implementation in Pascal will be equivalent to AMD’s though. It will be Nvidia’s first implementation of such fine grained scheduling. AMD has had ACE since the original GCN 1.0 GPUs (only 2 with initial revision) released in 2013, so they have several years more experience than Nvidia. Also, such fine grained scheduling is more CPU-like, which AMD obviously has significantly more experience working with, although even coarse grained threading for a CPU is probably relatively fine grained compared to most GPU scheduling.
The link is in a post on this
The link is in a post on this PcPer article!
https://pcper.com/reviews/Graphics-Cards/NVIDIA-Pascal-Architecture-Details-Tesla-P100-GP100-GPU
go read! and read the replies!
>Best to not guess or make up
>Best to not guess or make up stuff without any info…< >Its safer bet that nvidia has worked on that and it will be work fine…< So others should refrain from guessing yet it is perfectly ok for you to 'bet' (as if there's a difference)? >…being former AMD locked tech.< Huh? A conscious decision on the part of one company to exclude a certain feature does not make said tech a locked one. The same can be said of a fanboy's refusal (or inability) to critically think; it does not automatically translate into privilege or luxury on the part of people who do.
Async-compute has never been
Async-compute has never been a locked AMD technology and you arbiter’s sockpuppet know that very well! It was Nvidia’s decision with Maxwell to reduce enegry consumption at the cost of compute, and async-compute to reduce the hardware functionality and market the power savings to the gaming market. That worked out fine for gaming workloads under DX11, and eariler APIs but not so for the Vulkan and Dx12 APIs now that VR gaming will be using the GPUs async-compute in hardware to run the VR games with the least amount of CPU to GPU induced latency. That CPU to gpu latency is the direct result of having to encode and decode into the PCI protocol, so getting more non graphics gaming compute doen on the GPU, as well as the graphics compute doen on the GPU will reduce the amount of CPU to GPU latency inducting encoding/decoding communication to a minimum.
Reducing compute in Nvidia’s gaming SKUs allowed AMD to have an advantage for other types of GPU compute like non gaming graphics, and other computing workloads(Bitcoin) and such. It did have an its advantages on laptop mobile SKUs, but now with more HSA style GPGPU compute/and non graphics gaming compute becoming necessary for gaming/VR gaming Nvidia will have to put the hardware in its consumer SKUs and compete with AMD’s async-compute on DX12 and Vulkan enabled games.
edit: doen
to: done
(in more
edit: doen
to: done
(in more than one place)
my dyslexia-fu is powerful today
AFAIK nvidia never have
AFAIK nvidia never have something similar to amd ACE in their hardware. The notion about nvidia reducing compute stuff in maxwell causing maxwell to be weak in any compute related stuff is misunderstood by most people. Even worse people thinking that nvidia droping those FP64 stuff in their gaming card are resulting maxwell (or even kepler) to be weak in certain compute relative work compared to AMD. If kepler and maxwell are so weak at compute compared to amd then nvidia will not dominate those HPC market. And using bitcoin as argument that nvidia is weak in compute are not valid. Bitcoin was faster with radeon because of it’s architecture design not because of it’s compute performance vs nvidia compute performance. Nvidia try to improve their bit coin performance with maxwell but by that time the mining bubble already burst. There even an attempt to compare the hash rate between a single 290X vs 4 750ti because power consumption wise it might favor those 750ti.
Well now Nvidia has to get
Well now Nvidia has to get some “ACE” type functionality into Pascal’s replacemment because AMD will be entering the HPC/Workstation GPU accelerator market with their ACE unit equiped Vega professional variants in 2017! And those markets and the VR gaming market wants the async-compute ability on the GPU(In the GPU’s hardware)! So it’s not just bitcoin in other compute tasks that can be done on the AMD GPU’s ACE units that will keep the need for CPU to GPU communication and power using transfers to a minimum.
More compute is being done on the GPU without the need for any CPU help or intervention. You can be damn sure that the future execale computing systems builders will be looking at AMD’s HPC APUs on an interposer systems to get that HBM power savings with high effective bandwidth features inherent in using HBM over 1024 bit traces per HBM die, the same APU on an interposer benifits of power saving and high effective bandwidth will be had for CPU to GPU communication via some interposer etched ten’s of thosands of traces wide direct CPU to GPU connection fabrics.
So both the VR/Gaming market and the HPC/Supercomputer(exascale) market will want more non-graphics gaming, and graphics compute done on the GPU as well as other HPC/compute done on the GPU’s ACE/ACE like units from AMD, or Nvidia when they get ACE like functionality on their GPU SKUs. Nvidia can bulid SOCs on an interposer too, even Power/Power8/Power9 systems on an interposer, the Power/Openpower ISA/IP is up for ARM style licensing by Nvidia/AMD/any Others from OpenPower!
On HPC async compute will not
On HPC async compute will not similar to how they are on games. In fact for HPC nvidia hardwarw have support for async compute ever since fermi. If those ACE really are important in HPC then AMD already take the market by storm ever since 2012. Because those ACE already exist since the first GCN. For HPC application nvidia has done a lot in their architecture to increase their utilozation espcially on double precision area. they go as far as releasing GK210 specifically for HPC market for that purpose. Amd release those S9150 (hawaii based) firepro in 2014 which is easily eclipse nvidia GK110/210 in terms of theorical performance and some predict that in 2015 the top ten of top500 list will be dominated by amd accelerator. In the end it did not happen. Because most of HPC client are waiting for nvidia pascal and intel KNL.
Will async compute going to
Will async compute going to be important in most games? Right now with hitman dev mention that async compute needs to be tweaked for each card (instead of general optimizaton on the architecture) making it a feature that many dev don’t want to deal with unless they were sponsored to use it. And recently hardocp did try looking at hitman perfirmance in DX12. In one of their test Fiji actually slower in DX12 while Hawaii consistently to be better in every DX12 test. They speculate that because fiji is based on GCN1.2 and improvent over GCN 1.1 async compute have more positive effect on GCN 1.1. Hence they speculate going forward async compute will have much less effect on amd (and nvidia as well) architecture because of much more efficient design.
Still on that async compute
Still on that async compute crap huh. Its all up to the game developers. If they see it will be extra work/costs and publishers want the game out by a certain time you can be sure they wont even bother with that shit.
And AMD doesn’t have that money to throw at different game devs to implement this, maybe just one over at DICE and the AotS lol.
Waaaay too early as well to keep talking about this with the rumours and speculations going around and you have absolutely nothing better to do with your time & life. I rather spend it on playing actual games (Dark Souls 3 & Paragon) and going on my long summer holiday.
Hopefully there will be something concrete out by the end of the year if not next year. My 980 Ti is sufficient for old and new upcoming games anyway.
I’m out peace…
It’s more up the the Gaming
It’s more up the the Gaming engine/System software engineers and the makers of the gaming engine SDKs to automate things for these games developers who lack the systems programmer skill sets to optimize things themselves.
There will be plenty of SDK plug-ins provided for the gaming engine market to abstract/automate away the hard part parts of both Vulkan’s and Dx12’s hard parts. And just like M$’s visual studio has code designers for .net and Forms/other programing to generate code for the noobs/others there will be that for the mostly gaming engine script kiddies that work on those parts of the games development. Code can be generated for the script kiddies and then gone over and hand optimized by more competent systems programmers to make things work better at the close to the metal graphics APIs!
There will even be OpenGL to Vulkan wrapper code/code conversion to SPIR-V IR so even legacy OpenGL code can be run under Vulkan if there are any efficiencies to be had by doing so, and there probably is. Other high level languages are getting SPIR-V back ends and other integrating features to make for more compute/graphics done via the Vulkan API.
So it’s more up to the systems programmer/software engineers with the big paychecks doing the hand holding for others and the game’s developer’d budget/Greed in getting the game to market in a properly running state to take advantage of the new graphics APIs!
Am I completely wron in
Am I completely wron in thinking (don’t shoot me, this isn’t even back-of-the-napkin math) that this won’t be radically faster than the GM200 (outside of double precision)? Given that the SMs now have half the 32-bit ALUs, 60 (56) Pascal SMs is roughly equal to 30 (28) Maxwell SMs – and the GM200 has 24, that’s a 16% increase for 56 or 25% for 60. I was expecting more given the two-node jump in process and die size being roughly the same. I get that FP64 hardware takes up a lot of die space – might Nvidia be saving the “pure GPU” version of this for whatever is after Pascal? I.e. ~600mm2 with 1/32 FP64 and another 20% or so increase in throughput, produced on the same node?
GP100 has a 2:1 SP/DP ratio
GP100 has a 2:1 SP/DP ratio because of the workloads its designed for.
In reality GM200 was a failed architecture because it was supposed to be 20nm and actually improve on GK110. Since TSMC couldnt make 20nm GPUs cheap enough, Nvidia and AMD stuck with the 28nm process.
They stripped out the DP cores from GM200.
GP100 is focused on high DP performance and has a 2:1 ratio because Knights Landing does and thats its primary competetion.
The CPU and GPU market caters to hyperscale customers that spend billions so the big GPU from Nvidia has to be good for that. GP100 looks even better than Knights Landing on paper.
However to directly address your question about the SM. GP100 has 60 SMs and they actually can have more threads in flight because of the way they reorganized the cores within the SMs.
And on paper, the FLOPS may seem like less of an improvement than if Nvidia had gone with a lower DP core count, but Nvidia sells tens of thousands of GPUs to hyperscalers at thousands of dollars a piece. The small chip is going to be better for games as usual, in proportion to its die size and TDP.
The other thing to take into consideration when comparing these GPUs is the memory subsystem. GP100 and GP104 will both have abput 2x the memory bandwidth of the chips they replace.
You’re not answering my
You’re not answering my question. I’m not talking about compute, nor the GP100 as such, I’m talking about Pascal-based graphics-focused GPUs. As I said, OUTSIDE of DP performance, it doesn’t look like a huge jump. Sure, this is a chip that will never see consumer sales. But it’s still the only known representative of a family of chips that will.
And yes, of course I know that they stripped the DP cores out of GM200 to give it the maximum amount of graphics processing power possible in that die size. That is why I’m asking what I’m asking.
The thing is, I really don’t give a damn about DP performance – it just isn’t relevant to me. It’s not what I’m asking about either. What I’m saying is that this chip, from the looks of it, doesn’t improve on non-DP performance in the way I was expecting. Are you saying that GP102/104 will be stripped of DP ALUs like Maxwell? As in actually designing the chip without FP64 ALUs, not just disabling them? If not, what you’re saying isn’t really all that relevant.
If you are, that’s pretty huge in my book – the effort required to include/exclude this from GPUs is big enough that no compute focused maxwell chip was ever launched, after all.
Okay, so halving the amount of ALUs per SM allows them to have more threads in flight. But was that an issue for graphics processing in the previous generation? Does that improve graphics performance? Or does it only improve GPGPU/compute performance?
And yes, of course improved memory speeds will make for better performance. That’s a given. What I’m saying is that – from the looks of the GP100 – Nvidia isn’t pushing the limits of what they can do in terms of graphics processing power. But of course, this is all just speculation until GP102/104 is launched.