Process Technology Overview
Evidence points to 20 nm products as being undesirable for GPU technology.
We have been very spoiled throughout the years. We likely did not realize exactly how spoiled we were until it became very obvious that the rate of process technology advances hit a virtual brick wall. Every 18 to 24 months we were treated to a new, faster, more efficient process node that was opened up to fabless semiconductor firms and we were treated to a new generation of products that would blow our hair back. Now we have been in a virtual standstill when it comes to new process nodes from the pure-play foundries.
Few expected the 28 nm node to live nearly as long as it has. Some of the first cracks in the façade actually came from Intel. Their 22 nm Tri-Gate (FinFET) process took a little bit longer to get off the ground than expected. We also noticed some interesting electrical features from the products developed on that process. Intel skewed away from higher clockspeeds and focused on efficiency and architectural improvements rather than staying at generally acceptable TDPs and leapfrogging the competition by clockspeed alone. Overclockers noticed that the newer parts did not reach the same clockspeed heights as previous products such as the 32 nm based Sandy Bridge processors. Whether this decision was intentional from Intel or not is debatable, but my gut feeling here is that they responded to the technical limitations of their 22 nm process. Yields and bins likely dictated the max clockspeeds attained on these new products. So instead of vaulting over AMD’s products, they just slowly started walking away from them.
Samsung is one of the first pure-play foundries to offer a working sub-20 nm FinFET product line. (Photo courtesy of ExtremeTech)
When 28 nm was released the plans on the books were to transition to 20 nm products based on planar transistors, thereby bypassing the added expense of developing FinFETs. It was widely expected that FinFETs were not necessarily required to address the needs of the market. Sadly, that did not turn out to be the case. There are many other factors as to why 20 nm planar parts are not common, but the limitations of that particular process node has made it a relatively niche process node that is appropriate for smaller, low power ASICs (like the latest Apple SOCs). The Apple A8 is rumored to be around 90 mm square, which is a far cry from the traditional midrange GPU that goes from 250 mm sq. to 400+ mm sq.
The essential difficulty of the 20 nm planar node appears to be a lack of power scaling to match the increased transistor density. TSMC and others have successfully packed in more transistors into every square mm as compared to 28 nm, but the electrical characteristics did not scale proportionally well. Yes, there are improvements there per transistor, but when designers pack in all those transistors into a large design, TDP and voltage issues start to arise. As TDP increases, it takes more power to drive the processor, which then leads to more heat. The GPU guys probably looked at this and figured out that while they can achieve a higher transistor density and a wider design, they will have to downclock the entire GPU to hit reasonable TDP levels. When adding these concerns to yields and bins for the new process, the advantages of going to 20 nm would be slim to none at the end of the day.
Hindsight is of course 20/20, but back in 2012 we started to hear about a push to develop FD-SOI (fully depleted) products for 28 nm and 20 nm. AMD has a history of using PD-SOI (partially depleted), but when they spun off their fabrication arm to GLOBALFOUNDRIES, the group decided to forego development on any more SOI products and concentrate on bulk silicon (like Intel had done). The idea here was that materials such as those used in HKMG production would scale adequately from 28 nm to 20 nm, thereby delaying the R&D costs of developing FinFET technology for another couple of years. Why spend the money now if there is no pressing need for it? If bulk silicon and current materials could power the industry for the next few years, why go off on a sidebranch of SOI technology that could potentially not pay for itself?
ST-Micro developed a 28 nm FD-SOI process, but unfortunately it was done at a Fab that could not provide nearly enough wafers a month to satisfy any kind of demand. If I remember correctly, it was limited to several hundred wafers a month. It would be enough to handle some RF designs, but it would be entirely inappropriate for any kind of large scale production of a part that would go into a GPU product line or a low power, mass produced handset. This particular process node was a great success in terms of power consumption and transistor switching performance. ST-Micro showed off ARM Cortex-A9 designs that hit 3 GHz all the while having better overall power characteristics at idle and full load than 28 nm HKMG products.
We started hearing about the potential of this technology and that a theoretical 20 nm FD-SOI planar product would have slightly better electrical characteristics than Intel’s first generation 22 nm Tri-Gate. A gate-last implementation could have been class leading in terms of feature size and power/speed characteristics. Unfortunately for this technology, there was a lot of risk involved with developing a 20 nm FD-SOI product line. Equipment to handle bulk silicon will have to be modified or replaced entirely to handle FD-SOI. It is an expensive endeavor, plus while FD-SOI can support FinFET technology (FinFETs are in fact based on fully depleted deposited layers) most of the current research from multiple competitors has all been on bulk silicon. We can address “what ifs” all day, but when looking back it would have paid whoever had been able to develop planar FD-SOI handsomely when we look at how long 28 nm HKMG has been extended as a leading edge process technology.
Apple's A8 SOC is one of the first large, mass produced chips based on 20 nm planar technology. (Photo courtesy of Chipworks)
Looking over the foundry landscape we now understand why we have seen the 28 nm HKMG process last as long as it has. It is no longer cutting edge, but it is well understood and quite mature. AMD and NVIDIA have had to do a lot more in terms of design to overcome the limitations of the 28 nm HKMG process. Some years ago I had theorized that we would see a situation where process tech would simply come to a standstill for a longer than expected time, and that is when design and engineering would have to come to the fore to progress chip level improvements.
28 nm for GPUs Through 2015
This is where some speculation begins. So far we have only seen 28 nm products from NVIDIA as they have refreshed their lineup with Maxwell based parts. The GM200 is a massive chip at around 600 mm square, which is near the reasonable reticle limit of 28 nm. Yes, guys like IBM have chips that are larger in size, but these are not exactly mass produced parts that are supposed to have reasonable margins attached to them all the while addressing the consumer market. The GM200 looks to be the final puzzle piece for NVIDIA throughout 2015, with Pascal based parts being introduced in 2016.
Fantastic article Josh.
Fantastic article Josh.
A great read. Thanks
A great read. Thanks
Finally, my patience hath
Finally, my patience hath been rewarded. A joshwalrath article! On April Fools no doubt.
Its not really here. It is
Its not really here. It is just gibberish disguised as an article. So yes, April Fool's on you!
Awesome article.
Awesome article.
Very interesting!
Very interesting!
AMD 390X Fiji will be 8GB HBM
AMD 390X Fiji will be 8GB HBM made with Samsung 14nm FinFET process and 165W TDP. Expect this to be clocked at 1300 to 1400MHz and it’ll blow away the existing GM200 by 40% in a wide variety of workloads.
I predict AMD to take 70% discrete graphics card market share this year.
Why hold back? It is April
Why hold back? It is April 1st!
Because best April Fools
Because best April Fools jokes are ones that are not that obvious, you get suckered in and believing it for awhile before you realize what has happened, lol. Make it too obvious and then the joke is sorta a wasted effort 🙂
and sell for 300$
and sell for 300$
“The Apple A8 is rumored to
“The Apple A8 is rumored to be around 90 nm squared”
That’s a small SOC? and the 2 billion transistors on that 90nm must be subatomic, as 90nm squared only holds about 900 x 900 810,000 hydrogen atoms, and silicon atoms are a little larger(.111nm).
Whoops! Should be mm
Whoops! Should be mm squared. Where are my editors when I need them?
Edit: about 900 x 900
Edit: about 900 x 900 810,000
To: about 900 x 900, or 810,000
Author, please note that “xx
Author, please note that “xx units squared” and “xx square units” are not equivalent statements. When you say “600 mm squared”, you’re saying the die is 2.36 inches on a side and has an area of 360000 mm^2.
Thanks, apparently my fingers
Thanks, apparently my fingers really enjoyed adding that extra "d"… does that sound bad?
You think about D too
You think about D too much
and by that I mean drinking.
Nah, I only really think
Nah, I only really think about drinking on Wednesday and Friday nights.
Wednesday, what a convenient
Wednesday, what a convenient day of the week to have an April fools day on.
Oh hey, nice article Josh.
Maybe we will be on time this
Maybe we will be on time this evening? Ha!
Wednesday was April Fools, so
Wednesday was April Fools, so you were really thinking about the other D? o_O
haha
It might suck if AMD did try
It might suck if AMD did try to go 20 nm and it isn’t working out. People don’t realize that these decisions are often made years in advance of when we start hearing about the actual final product. We can sit hear and say (in hindsight) that they should have continued to use 28 nm. I suspect Intel’s 14 nm process isn’t yet suitable for large dies either. They have only been releasing very small die chips from 14 nm production. Currently released broadwell chips are only about 82 square mm, which is tiny.
Good observation. At least
Good observation. At least we know that Intel can do large 22 nm chips (Haswell E, Xeon, etc.). I think a lot of work still needs to be done on 14 nm, but at least they are shipping product based on it.
Yeah, I took note of that
Yeah, I took note of that right around launch. I have not been surprised by the lack of larger chips.
It is definitely good to know
It is definitely good to know that there is still improvements to be had at 28nm, so when the time comes that it’s too costly to go any smaller with circuit densities in the X and Y, improvements can continued to be made with any process node. Maybe even stacked CPU/GPU transistor designs like Memory is starting to be made with. Even at first if just the on DIE memory areas(Cache, other) of a CPU could be stacked, more CPU DIE area could be made available for extra processing logic. Even before the plainer process reaches its limits, the more established process nodes with the better yields could begin to use stacking, and save the high costs of getting the plainer dimensions any smaller. FINFET is a start, as is memory stacking, so the CPU/GPU logic will also be stacked although probably not to the extent that memory will be.
AMD’s using of its GPU design libraries for its x86 Carrizo APUs may be the way to go for mobile APU parts and with DX12, and Vulkan able to take advantage of multicore CPUs better, individual CPU single threaded performance is not as important as before. I just wonder what type of density AMD would get if it decides to make its mobile versions of its custom K12 ARM cores using GPU design libraries and get some very powerful 8 core Phone/Tablet SOCs. With the room saved by memory stacking, and using high density design libraries to make the CPU part of the APU take up less space, there could be more room dedicated to more graphics resources. Certainly any AMD custom ARM ISA based APU that utilized the high density design libraries could afford to have larger reorder buffer resources, and more execution pipelines, and SMT units in its CPU core/s than any competing Apple or other makers CPU cores made on any non high density libraries, for SOCs made on an equivalent fabrication process node.
High density design libraries and reworking of CPU/GPU layouts could get more life out of a process node, and save billions, before going full on 3d with everything becomes the only way to get more circuit density. Going smaller than 14 nm, or 10 nm may not be worth the extra added expense.
Keep the AMD news flowing!
Keep the AMD news flowing! Getting tired of the Intel-Nvidia news all the time.
Awesome read Josh!!!
Great article Josh (as
Great article Josh (as usual).
Just one thing I am not sure about –
“Few did not expect the 28 nm node to live nearly as long as it has”
You are basically saying that “many expected the 28 nm node to live as long as it has”
I think the above should have read:
“Few expected the 28 nm node to live anywhere near as long as it has.”
Thanks, fixed! This is where
Thanks, fixed! This is where I complain about editors again!
JoshTekk FTW.
JoshTekk FTW.
Articles like these are the
Articles like these are the reason why i check this site every day. Bravo!
“Compare that to the older
“Compare that to the older Tahiti (which powers the current R9 280 series) that has 4.3 million transistors”
-> 4.3 *b*illion
proofreading? lol
Voodoo 2 transistor counts!
Voodoo 2 transistor counts!
But can it, you know, run
But can it, you know, run Crysis? #JoshTekk
Great article!
I am not entirely sure any
I am not entirely sure any modern product can run Crysis at high levels…
Funny how NOTHING… even if
Funny how NOTHING… even if you have quad titans cant ever achieve more than 80fps in crysis1.
Sad,… means you cant ever play that game with lighboost
I need to re-install that on
I need to re-install that on my test bench and see how it runs/looks with modern hardware.
Great article Josh, thanks
Great article Josh, thanks
Great article as usual. I
Great article as usual. I think Intel foresaw the problems with planar 20 NM/22 NM and decided to go FinFET directly instead of doing the clusterf**k that TSMC and Samsung ended up doing. TSMC 16 NM FF is actually 20 NM FF. They even have it on their official website that 16FF provides virtually no area benefit over 20Planar. TSMC’s 16 NM FF+ and Samsung’s 14 NM FinFET seem to be smaller than 20 NM planar but larger than Intel’s 14 NM process.
If only they followed Intel’s footsteps and avoided 20 NM planar.
I don’t think the GPU industry will face too many problems regarding the different quality of these processes. GPUs are designed to be extremely wide machines anyway, going even wider is no problem for them. The CPU side is certainly worrisome. I think going Heterogenous is the only solution for large increases in performance. IMHO it’s pointless to put more than 6-8 cores for consumers. If somebody wants to design software for consumers that uses more than 8 cores, they might as well take the parallel-isable parts and run them on GPUs a la HMP.
Anyone care to enlighten me
Anyone care to enlighten me as to why GPUs are still (for the most part) focus on a large single die if they are having issues with die size limits? To me it seems like if large dies are difficult at the smaller processes, make the changes to make use of multiple dies. Overall power consumption would rise, and the design likely wouldn’t carry over to low power mobile use, but leveraging multiple chips (with the appropriate architecture changes) should be able to compensate for the delays and limitations you get with the interconnections between dies. Similar concept to multi GPU cards we see already, but you get smaller dies by specializing, removing redundancy, and the areas that are used to make each die a standalone GPU.
Well, basically the reason is
Well, basically the reason is that multiple chip solutions do not scale very well. Look at the issues we have had for years with SLI and CrossFire. Even though those standards have improved over time, they are still problematic. The same reason why the R9 295X2 is also sometimes slower than a single R9 290X. It is harder to split the workloads effectively to two different chips. Just too many parts that need to be able to share memory addresses in a single pool to be the most efficient at processing graphics.
Well then. Why do you propose
Well then. Why do you propose a single GTX 980 is faster than 2x GTC 960s? Both have roughly the same number of parts.
If you match clockspeeds of
If you match clockspeeds of the parts, then I would imagine that in a SLI optimized game they will be about even. In a game that does not scale well (or at all) with multi-GPU solutions, then the GTX 980 will thoroughly trounce the SLI'd 960s.
There are a lot of technical
There are a lot of technical issues with making a multi-chip solution. Splitting up the workload can cause issues, but I suspect a lot of these are going to be reduced as developers and engines are developed to make better use of multiple chips. We have some games that scale very well currently. I am wondering if DX12 and vulkan will significantly increase the efficiency of multi-gpu solutions. It handles multiple cpus much better, and removing the draw call bottleneck may make splitting the workload between multiple gpus much easier. We will not be able to test this properly until we get DX12 optimized engines.
Also, we are still left with each gpu requiring separate and duplicated memory, which is wasteful. This has the expense of extra chips and a lot more power consumption with multiple sets of memory and multiple high speed memory controllers. Two 960s only has 2 GB (usually), and a 980 has 4 GB, so there are a lot of things which will not run on 2 960s with only 2 GB local memory. Truly sharing memory between gpus is probably not easily workable. We do have high speed links available, but the latency is going to be higher and the bandwidth will not get close to local memory. Nvidia’s nvlink (first generation) is only going to be 80 GB/s or so. Local memory will be hundreds of GB/s. They could devote more pins to gpu-to-gpu communication, especially with a design like AMDs with HBM. The memory is on package, so the only thing routed outside of the transposer package is the cpu link and power/ground. They could potentially dedicate a lot of external pins (equal to a 256-bit memory interface) to gpu-to-gpu interconnect. This runs into the issue that it would be a large expensive device, and the interconnect pins would not be useful for a single gpu device. This makes it so it may not be economically worthwhile to design and build such a device.
There are a lot of interesting designs they could go for, but most will not be worthwhile unless the issue is forced. If you can’t get good yields for large dies, then some of these aspects may shift in the direction of multiple die solutions. It would be interesting to essentially design a giant single gpu split across multiple chips and place them on a transposer that allows wide links. Some of the memory sharing issues may be alleviated by compression and caching techniques, but this lends itself to fast local memory and a single slower pool of memory connected further out. This is kind of how the xbox one is designed with 32 MB local memory and a much slower pool of slow memory.
There’s a lot of overhead
There’s a lot of overhead built into making two cards that have everything they need to run independently (along with that being their primary design) and then shoehorn in the ability make them work in conjunction. There’s a big difference in making SLI/Crossfire work, because it isn’t handled seamlessly from perspective of the APIs used to make the games. From the perspective of the program, as long as it talks to the same instruction/controller chip, how that chip gets the work done is mostly irrelevant to anything that isn’t custom designed for a chipset.
In my (admittedly limited) understanding, going with multiple dedicated purpose built chips with a proper instruction chip to coordinate them all would be much different. You’d have better binning on the core chips which are allowed to be smaller and take advantage of the cutting edge processes that haven’t matured enough to allow for larger single chip dies. This also impacts yield, which is very important to cost. A design like this also lends itself toward the massive pixel challenge of the 4k/8k monitors by way of simple scaling and obfuscation in the black box of firmware.
There may even be some losses due to the fact you’re effectively emulating a single chip design with a multi-chip setup, especially when supporting older low level code that were designed for older generation chips. To me it still a viable path as long as you still come out ahead in speed, even if you have some challenges interconnecting and coordinating the chips. TDP and power requirements aren’t as big of a deal in my eyes for desktop (entirely different situation for mobile/laptop), and multiple chips alleviate some issues you get with the large single chips the transfer and dissipation of heat. Memory issues are there, but a lot of that is coordination, and the inherent non-local memory latency. Depending on all is included on the smaller dies, they may not need as many pins, and there is inherently an increase in latency in moving memory further away, but how much of that would actually slow the entire process down when you are doing massive numbers of pixels?
Whether or not any of that makes a multi-chip effort viable is how big of an jump do you get out of moving to the next fab node. If it’s a big enough jump and die size is what’s holding the progress back, it seems to me there are ways to move forward instead of wait if they wanted to. Then again if both companies are certain the other one will wait, then there’s good business logic to waiting as well.
The answer is very simple.
The answer is very simple. Larger the distance that data has to travel, slower is the rate at which it travels (due to many reasons that I don’t know). If you have two chips, they are going to be separated by atleast a mm or two. Inside a die, distance between transistors is of the order of NM. That is almost 6 orders of magnitude more distance data has to travel. As a consequence, data transmission speed is reaallly slow (in comparison).
That’s how I see it, anyway.
There are a lot of
There are a lot of limitations. For running through the pcb, high speed requires a lot of area on the both die due to the complexity of running a multi-layer packetized interface. It is essentially like going through a network card vs. just a connecting wire on die. This also requires significantly more power to drive signals that far. You also run into pin count limitations. GPUs already have thousands of solder balls on the bottom of them to drive wide memory interfaces (256-bit or more). The connection between cpu and gpu is technically only 16-bits wide (pci-e x16), but it is very fast. On-die, you might have 1024 bit connections and such since it is just a wire and it operates at core clock.