When AMD announced their Polaris architecture at CES, it was focused on mid-range applications. Their example was an add-in board that could compete against an NVIDIA GeForce GTX 950, 1080p60 medium settings in Battlefront, but do so at 39% less wattage than this 28nm, Maxwell chip. These Polaris chips are planned for a “mid 2016” launch.
Raja Koduri, Chief Architect for the Radeon Technologies Group, spoke with VentureBeat at the show. In his conversation, he mentioned two architectures, Polaris 10 and Polaris 11, in the context of a question about their 2016 product generation. In the “high level” space, they are seeing “the most revolutionary jump in performance so far.” This doesn't explicitly state that the high-end Polaris video card will launch in 2016. That said, when combined with the November announcement, covered by us as “AMD Plans Two GPUs in 2016,” it further supports this interpretation.
We still don't know much about what the actual performance of this high-end GPU will be, though. AMD was able to push 8 TeraFLOPs of compute throughput by creating a giant 28nm die and converting the memory subsystem to HBM, which supposedly requires less die complexity than a GDDR5 memory controller (according to a conference call last year that preceded Fury X). The two-generation jump will give them more complexity to work with, but that could be partially offset by a smaller die because of the potential differences in yields (and so forth).
Also, while the performance of the 8 TeraFLOP Fury X was roughly equivalent to NVIDIA's 5.6 TeraFLOP GeForce GTX 980 Ti, we still don't know why. AMD has redesigned a lot of their IP blocks with Polaris; you would expect that, if something unexpected was bottlenecking Fury X, the graphics manufacturer wouldn't overlook it the next chance that they are able to tweak it. This could have been graphics processing or something much more mundane. Either way, upcoming benchmarks will be interesting.
And it seems like that may be this year.
“Also, while the performance
“Also, while the performance of the 8 TeraFLOP Fury X was roughly equivalent to NVIDIA’s 5.6 TeraFLOP GeForce GTX 980 Ti, we still don’t know why. AMD has redesigned a lot of their IP blocks with Polaris; you would expect that, if something unexpected was bottlenecking Fury X,”
I think the bottleneck was some what already revealed. Remember back when AMD announced the Fury X they claimed 30% faster then a 980ti IF you used the settings they selected which were settings that took advantage of power of the shaders and most any setting that didn’t was turned off or low. There in lyes the answer. When doing Non-shader array based work like for example AF work amd cards are much slower at doing that work which nvidia cards are better at doing.
Could I be wrong yes but I think that is pretty much hitting the nail on the head.
Kind-of. “Non-shader” hits a
Kind-of. "Non-shader" hits a variety of fixed-function ASICs. Is it geometry? ROP? etc.
Hey Scott,
The reason you see
Hey Scott,
The reason you see a 5.6TFLOPS NVIDIA GPU out performing an AMD 8 TFLOPS GPU is the trick that NVIDIA played for the last few years (Boost Clocks) , and I’m not talking about the boost clocks you see in GPU-Z , I am talking about the real GPU clocks during tests or gaming, what we all see ( e.g in GPU-Z) as base GPU clocks and boost clocks , in reality has nothing to do with the GPU real clocks it is just the base clocks, and I am sure that all of us know that.
take a look at the in-Test/game gpu clocks for all the Nvidia GPUs for example the GTX 980Ti :
Gigabyte GTX 980 Ti WaterForce 1510 MHz
ZOTAC GTX 980 Ti AMP Extreme 1505 MHz
MSI GTX 980 Ti Lightning 1518 MHz
Colorful iGame GTX 980 Ti 1432 MHz
Palit GTX 980 Ti JetStream 1515 MHz
ASUS GTX 980 Ti STRIX 1472 MHz
ZOTAC GTX 980 Ti AMP! 1465 MHz
MSI GTX 980 Ti Gaming 1507 MHz
EVGA GTX 980 Ti SC+ 1491 MHz
Gigabyte GTX 980 Ti G1 Gaming 1512 MHz
And the Stock:
GTX 980 Ti 1437 MHz
over 90% of the 980Ti cards saw so far was running above 1440MHz in-game ofc NON OVERCLOCKED!(some even above 1550MHzand i think you know that too).
and if we take the GPU real clocks into account that means :
Lets pick the worst case( the stock sample we got in here)
at 1437 MHz , 2816sp*1437MHz*2= 8093184 MFLOPS and that’s around 8093.184 GFLOPS! on a stock card.
MSI, GIGABYTE , ASUS , EVGA all got their cards running at around +1500MHz in real time.
that means : 2816sp * 1500MHz * 2 = 8448000 MFLOPS =
8.4 TFLOPSE in real time in-game !
while AMD GPUs already shows the max real performance. Nvidia always choosing to do it this way for some reasons.
also in some cases add to this a little more tweaked drivers , and there you go .
*my source for the clock speeds is TPU.
I have not seen any sites
I have not seen any sites claiming that the 980 Ti is running overclocked without the user knowing. Where has this been reported? Why tools do your I use to get the GPU clock?
I guess this is “Adaptable
I guess this is “Adaptable GPU boost technology 2.0”? Perhaps I haven’t seen any real GPU clocks in the reviews here at Pcper. I believe they just report GPU clock as 1000 MHz.
Yes, GPU Boost 2.0 results in
Yes, GPU Boost 2.0 results in big differences between the advertised (aka guaranteed) clock speeds and the actual clock that the card will achieve without any user input.
Yeah, they boost quite high.
Yeah, they boost quite high.
Another funny fact is that
Another funny fact is that regardless of your overclocking the in game clocks will always stay stable around 1450-1500Mhz in most cards. And that is why all nvidia GPUs can be easily overclocked up to that point. And when we try to push the gpu above 1500MHz( which will force the in-game clocks to go much over 1500MHz) we will always hit a wall in oc/ing. unless we increase the vcore.
even if it was boosting at
even if it was boosting at only 1200MHz, the 980Ti is already doing way more than just 5.6 TFLOPS.
even if it was boosting at
even if it was boosting at only 1200MHz, the 980Ti is already doing way more than just 5.6 TFLOPS.
ROPs. It’s been suspected
ROPs. It’s been suspected for a long time that Hawaii, and now Fiji, were being strangled by only having 64 ROPs. I don’t know why AMD couldn’t put more ROPs on Fiji, but I have a suspicion that if it had, say 96 or 128, it would’ve been everything AMD claimed it was going to be, even if they’d had to sacrifice some shaders to make space for them.
“Also, while the performance
“Also, while the performance of the 8 TeraFLOP Fury X was roughly equivalent to NVIDIA’s 5.6 TeraFLOP GeForce GTX 980 Ti”
That’s DX11 testing while DX12 and Vulkan are replacing DX 11, and the benchmarks will be different. AMD’s GPU SKUs Have more compute, but that does not mean that “something unexpected was bottlenecking Fury X”, maybe current DX11 games are not using as much compute, DX12/Vulkan as well as VR games will make use of that compute in AMD’s Fury/Fury X SKUs.
AMD’s GCN GPUs will all need to be retested on DX12/Vulkan based games, to see the results of that extra compute for gaming. AMD has redisigned for Polaris so still until the gaming software stack begins using the newest Graphics APIs benchmarking on older graphics APIs may not show as much improvment.
“upcoming benchmarks will be interesting.” not as interesting on the older Graphics APIs they will need to be done on the newer APIs.
Spraking of Benchmarks!
“AMD accuses Intel of VW-like results fudging”
“Revives ancient SYSmark dispute”
http://www.theregister.co.uk/2016/01/19/amd_accuses_intel_of_vwlike_results_fudging/
To shed some light on that
To shed some light on that that “HBM memory controller needs less complexity than GDDR5”, let me suggest that there are three main issues to be addressed with when communicating to memory.
First, you have to speak the protocol–this means keeping track of open pages, sending commands in the right order, etc. For that, I have to think that HBM is a little more complex than GDDR5, but this is pure logic and can be made very dense.
Secondly, you have to sequence signals going to the chip. This is generally so easy as it’s just a few state machines.
Finally, there is getting those logical signals from one chip to the other. When those signals have to go through a chip package, wander around a PCB made by the lowest bidder, and into a chip and back onto a die, things get complex. The driver hardware on both sides has to compensate for all kinds of irregularities and problems in that path. Then look at what HMB gets to do. It’s pretty much one big die and there’s a chip on the other end what will control the chips attached to it. That’s *vastly* simpler. That’s where I’m guessing HBM wins out over GDDR5. Oh, this last step is also where most of the power is used.
Nice! Thanks for the
Nice! Thanks for the insights.
The HBM protocol is actually
The HBM protocol is actually less complex than GDDR5. GDDR5 has to operate at a very very high clock rate and send two bits on the rising and falling edge of each clock. It does this by using four different voltages to encode the two bits. This of this as very similar to how a MLC SSD works. HBM has enough pins that each pin can be low speed and only one bit needs to be sent per clock, more like an SLC SSD.
The biggest advantage of why HBM wins is because you don’t need to allocate power budget to big memory controllers that operate at 1.75 Ghz and send four bits per clock (1.75 x 4 = 7 Ghz effective). Instead, HBM can use a bunch of plain old DDR memory controllers at 500 Mhz. Since power consumption is not linear at all with clock speed, using a very wide HBM interface is much more efficient than a narrow but fast GDDR5 controller of the same bandwidth.
I’m not sure how seriously to
I’m not sure how seriously to take your reply considering you don’t understand how quadrature clocking works.
For those following along, no, GDDR5 does not use ‘four voltage levels to send two bits’. It sends *one* bit on each rising and falling edge of *two* different clocks that are 90 degrees out of phase:
Clock0:01100110011001100110011
Clock1:00110011001100110011001
Bit :00000000001111111111222
:01234567890123456789012
At each interval, one of the clock signals is either rising or falling. This is what drives the logic.
Your second error is misunderstanding the power use. Power useage of a toggling signal increases *linearly* with frequency, but it increases with the square of voltage. And since you need higher voltages to compensate for the noise of running over a PCB, that is how HBM wins over GDDR5. There is also the secondary issue of capacitence differences between the PCB and the interposer which also work in HBM's favor.
I don’t think your
I don’t think your explanation came through too well; partially formatting error? The Wikipedia entry on “quad data rate” explains it in layman’s terms with a diagram. I am assuming that the wiki explanation is similar, if not the same, as how QDR is implemented in DDR3 and DDR4. This isn’t really that relevant to the lower power consumption of HBM. I believe HBM does actually use DDR signaling though, like DDR2.
With HBM, the signal only has to go through two solder micro-balls and maybe a few millimeters of the silicon interposer. The interposer itself is not powered (AFAIK), but it can contain passive devices like capacitors to help with signal integrity. GDDR5 has to go through 4 solder balls and several centimeters of 3 different PCBs. For the solder balls, you would have the GPU die to GPU package, GPU package to board, board to memory package, and memory package to memory die. For a DDR4 memory module, it is even worse with another PCB, non-soldered connector, and even greater distances. I would assume that the power consumption differences come mostly from the short drive distances, low number of solder connections, and low clock allowed by the wide interfaces. The short drive distance means lower resistance, capacitance, and interference making for a simpler, lower power controller. Much of the power savings may come from the reduced size and complexity of the controller. The interface voltage seems to be 1.2 V; the same as DDR4, so that doesn’t account for the power reduction. If you, or anyone reading this, work in the field, it would be great to have an expert explanation of where the power savings come from.
I did read an article on ExtremeTech a while ago comparing wide io, HBM, and HMC. I have seen articles about the inefficiency of DDR4 and GDDR. It just isn’t going to scale much further since the interface takes too much power and too much die area, both on the controller side and the DRAM side. HMC goes the opposite route for the interface compared to HBM. HMC uses stacked memory die also, but the bottom logic die converts the interface into a narrow (8 or 16 bit) high speed, point-to-point, differential serial link. The ExtremeTech article has some TSMC charts showing bandwidth vs. power and bandwidth vs. price. This unfortunately doesn’t include GDDR but it should be similar to DDR scaling.
The current HBM standard has
The current HBM standard has each HBM stack with its own 1024 bit channel so that’s 128 bytes of parallel traces/channel to each HBM stack for a total of 4 stacks of HBM at 4096 bits total parallel traces/channels, so who knows what HBM3 will bring, but HBM2’s bandwidth is double that of HBM. I wonder if future JEDEC HBM standards could allow for more than 1024 bits per stack to allow HBM to be clocked lower and for more power efficient use.
The JEDEC standard probably only covers what is needed standards wise to connect to one HBM stack anyways so there is probably the ability have 6, or even 8 or more HBM stacks, and even 2 stacks or a single stack of HBM for mobile APUs that use HBM/HBM2 it all just depends on the total bandwidth needed for the processor.
Definitely the interposer has the ability to have tens of thousands of parallel traces etched into the interposer’s silicon substrate to HBM and Other dies. Where the interposer will really take off is when AMD starts to make high powered server APUs with its highest end accelerator GPUs sharing the interposer with Zen cores and 32GB of HBM, AMD will be able to have thousands of wide parallel traces directly between the Zen cores’ die and the separate GPU accelerator die for a total bandwidth directly between the CPU and the GPU on the interposer package that will provide many times the effective total bandwidth of any PCI based CPU to external GPU connection in existence. AMD will definitely derive a consumer gaming variant from its Server/Exascale APUs so future gaming PCs, and consoles, can have much more processing power.
I could even see the Oculus Rift VR gaming hardware company maybe even introducing a VR gaming console of their own and commissioning AMD for the APU part, Oculus Rift’s owner has the funds to commission a VR gaming console to be built. But AMD is going to have many different APU on an interposer based SKUs in the future.
It isn’t a 1024-bit channel.
It isn’t a 1024-bit channel. Each stack has somewhat independent 128-bit channels, 2 per die. This makes 8 channels in a single 4-hi stack and a total of 1024-bits for each stack. HBM2 allows for 8-hi stacks, but I don’t know if it is still 2 channels per die. There has to be area devoted to passing the signals down the stack. If it is still 2 channels per die, then an 8 hi stack would be 16 channels and 2048 bits. I suspect 4 high stacks will be common for high-end consumer parts, and mostly professional parts will use 8-hi stacks.
As for the connections between CPUs and GPUs, it doesn’t necessarily need to be that fast. They both just need to be able to access the same memory space at high speed. This could be implemented several different ways. Given the bandwidth requirements, it will be better to have the HBM connected to the GPU and a connection from the GPU die to the CPU die. The connection to the CPU die wouldn’t actually need to be that fast compared to other connections in the system., so speaking of thousands of traces to connect them together is overkill. The interposer will allow a lot of other interesting things though. Given that yields on 14 nm may not be good enough to support giant GPU die like we see at 28 nm (close to 600 square mm for the Fury), it may be an effective solution to use multiple smaller GPU die.
The product I am most interested in, besides a high-end HBM GPU, is a mobile APU with one or two stacks of HBM. This could deliver exceptional performance with unprecedented power efficiency. It solves the main problem with integrated graphics, which is the lack of memory bandwidth delivered by system memory. Even a single stack of HBM1 can deliver a gigabyte of memory with 128 GB/s of bandwidth. A Nvidia 960 is 112 GB/s and a 970 is (actually) 196 GB/s. With HBM2 they could easily have 2 GB and 256 GB/s in an APU.
I think that AMD is going to
I think that AMD is going to be making a high end Zen based laptop APU on an interposer with the Zen cores directly wired via an interposer to a more powerful separate mobile Polaris GPU variant with both the Zen cores and the GPU sharing the HBM/HBM2. So I would expect an AMD Laptop APU on an interposer that will have more graphics/CPU processing power than the current generation consoles.
I’m hoping for a Steam Machine branded laptop with this Mobile/laptop APU on an interposer! It should be more powerful than most non HBM/Interposer based Laptops SKUs, as even the laptops with discrete mobile GPUs still have to communicate over PCIe, and that interposer based APU will probably have thousands of direct traces between Zen cores and the Polaris GPU for way more than PCIe x16 effective bandwidth at a much more power efficient lower clock speed.
I’ll bet that the APU will probably only get 2 HBM2 4-Hi stacks(8GB of memory total) but that will be enough for the GPU and the CPU to each get enough memory to host the OS, gaming engine and game, while still letting the GPU have enough memory for textures and higher resolution laptop gaming. Samsung plans to produce an 8GB HBM2 8-Hi DRAM package within this year, so that 2 HBM2 stacks could support 16GB of memory.
It will be great if we can
It will be great if we can get the low-end and mid-range parts at the same time. The mid-range part would probably perform close to 28 nm high-range parts. I am curious as to whether we will get an HBM part though. Both parts might be GDDR5 based. It may only be the high-end part coming later which will be HBM or HBM2. It doesn’t seem likely that they would design a mid-range part capable of connecting to both since this would be a waste of die space.
I thought that the primary
I thought that the primary issue with Fury was that the HBM was awesome, but Fiji itself can’t clock high enough to satisfy it.
https://youtu.be/A1_5plE9JMg
I think the problem with Fury
I think the problem with Fury was rather the lack of software support for overclocking. The current version of Sapphire’s software suite is able to also modify the voltage, but it was released several months after Fury launch.
Pretty much all online reviews compare overclocked GTX 980 Ti cards with barely/non-overclocked Fury (X) cards.
“but do so at 39% less
“but do so at 39% less wattage than this 28nm, Maxwell chip” – this wont help PCper to NOT look nvidia biased
Comparing total system Wattages when talking about the gpu`s?
86W vs 150W was it? How much does the system without the gpu`s consume? 60w at least?
So AMD made a 26W card that performs just as good as a 90W nvidia one. Thats 249% more efficient, not “39%”
Such a super fanboy site really… that or just incompetent
To be fair you are comparing
To be fair you are comparing 28nm planer silicon to 14nm finfet silicon so you expect it to be more efficient. However you are right in stating he used the wrong words to describe the comparison, he should have stated “39% less total system wattage on average than …”
Reading on through this article Mr Michaud wrote, its very evident he is looking at this with high-end desktop GPU gaming performance [not efficient small silicon] in mind. Nevermind the title of the article you have:
“This doesn’t explicitly state that the high-end Polaris video card will launch in 2016.”
“We still don’t know much about what the actual performance of this high-end GPU will be”
This despite the information is about small, efficient GPU chips (from either company) should SCREAM speculative/opinion piece straight off the bat. Not informative, accurate data for cross examination and judgement of given product(s).
As to your last statement, it just makes you look like an AMD fanboy more than anything else. Sure Michaud comes across through this article as a tad nvidia biased/fanboy with the tflop (compute) comparison for performance (gaming) but hasn’t AMD used this metric in their marketing material for consumer products? Because AMD never use specific game settings to fully maximise its hardware strengths…
Slandering the whole site and its staff for one authors article talking about subjects not directly related to the data…yeah good one mr/ms/mrs/prof/dr/cmdr Anonymous.
p.s show me a site thats 100% neutral and objective… there is a good reason to use multiple sites. You should try sites like LTT, Guru3D or overclock3d… they totally give a fair shake (with GPUs)! /sarcasm