A Look Back and Forward
What makes NVIDIA’s new RTX GPUs tick?
Although NVIDIA's new GPU architecture, revealed previously as Turing, has been speculated about for what seems like an eternity at this point, we finally have our first look at exactly what NVIDIA is positioning as the future of gaming.
Unfortunately, we can't talk about this card just yet, but we can talk about what powers it
First though, let's take a look at the journey to get here over the past 30 months or so.
Unveiled in early 2016, Pascal marked by the launch of the GTX 1070 and 1080 was NVIDIA's long-awaited 16nm successor to Maxwell. Constrained by the oft-delayed 16nm process node, Pascal refined the shader unit design original found in Maxwell, while lowering power consumption and increasing performance.
Next, in May 2017 came Volta, the next (and last) GPU architecture outlined in NVIDIA's public roadmaps since 2013. However, instead of the traditional launch with a new GeForce gaming card, Volta saw a different approach.
Click here to continue reading our analysis of NVIDIA's Turing Graphics Architecture
Launching with the Tesla V100, and later expanding to the TITAN and Quadro lines, it eventually became clear that at least this initial iteration of Volta was targeted at high-performance computing. A record-breaking CUDA core count of 5120, HBM 2.0 memory, and deep learning acceleration in the form of fixed function hardware called Tensor cores resulted in a large silicon die, unoptimized for gaming.
This left gamers looking for the next generation of GeForce products left in a state of confusion of if we would ever see Volta GPUs for consumers.
In reality, Volta seemingly was never intended in any form for gamers and instead marks the departure of NVIDIA's high-end compute-focused GPUs from its gaming offerings.
Instead, NVIDIA now has the ability run two different GPU microarchitectures, targeted at two vastly different markets in parallel, Volta for high-end compute and high-end deep learning application, and Turing for their bread-and-butter industry, gaming.
This means instead of tailoring a single architecture for the best compromise between these different workloads, NVIDIA will be able to adapt their GPU designs to best suit each application.
A distinct lack of performance or technical details at RTX 2080 and 2080 Ti has lead to rampant internet speculation that Turing merely is Pascal 2.0, with some extra dedicated hardware for ray tracing and deep learning, running on 12nm. However, this couldn't be further from the truth.
At the heart of Turing is the all-new Turing Streaming Processor (SM). Split into four distinct blocks; the Turing SM provides a departure from the SM design seen previously in Maxwell and Pascal.
In each SM, you'll find 16 FP32 Cores, 16 INT32 Cores, two Tensor Cores, one warp scheduler, and one dispatch unit. Additionally, there is one RT core found in every SM, for ray tracing acceleration. Notable here is the addition of INT32 cores, with the ability to execute INT32 and FP32 instructions simultaneously, as seen in Volta.
Simultaneous execution of these different precision workloads allows for more efficient use, requiring fewer clock cycles to achieve the same amount of work. This simultaneous execution ability is enabled the redesigned memory interface of the Turing SM.
Previously, as seen in Pascal, access to the memory subsystem and access to the L1 cache were split. In Turing, NVIDIA has moved to a new unified memory architecture, which allows for a larger L1 cache, which is configurable on the fly between a 64KB (+32KB Shared Memory) and a 32KB (+64KB Shared Memory) split. Additionally, the L2 sees a doubling per SM from 3MB to 6MB.
The increase (up to 2.7x) in L1 size and addressability results in what NVIDIA is claiming is a 2x increase to L1 hit bandwidth, and lower L1 hit latencies
Overall, NVIDIA puts the changes to allow simultaneous execution of both INT32 and FP32 instructions, as well as the changes to the memory architecture at a 50% performance improvement per SM, when compared to Pascal.
In addition to the layout changes of the memory subsystem, Turing also sees a different memory interface, in GDDR6.
Operating at 14 Gbps, GDDR6 provides an almost 30% memory speedup when compared to the GTX 1080 Ti, which runs its GDDR5X memory at 11 Gbps, while remaining 20% more power efficient than Pascal GPUs.
Through careful board layout considerations, NVIDIA claims a 40% reduction in signal crosstalk compared to GDDR5X implementations, which is in part how they can achieve such a higher transfer rate.
Building upon Pascal’s memory compression techniques, NVIDIA also claims a substantial improvement to memory compression technology with Pascal, which when combined with the faster GDDR6 memory results in an overall 50% higher effective memory bandwidth for Turing.
Having seen their introduction in the NVIDIA’s Volta architecture, Tensor Cores also see some significant changes in Turing.
In addition to supporting FP16, Turing’s Tensor cores add the INT4 and INT8 precision modes. While at the moment there aren’t any real-world uses for the lower precision INT4 and INT8 modes, NVIDIA is hoping these smaller data types will find applications in gaming, where less accuracy may be acceptable as compared to scientific research.
Since the throughput of these different Tensor core modes scales linearly from 110 TFLOPS of FP16 performance translates into 220 TFLOPS INT8, and 440TFLOPS INT4, there are some opportunities for massive speedups in workloads where the precision afforded by FP16 is unneeded.
The all-new type of hardware present in Turing is the RT cores. Meant to accelerate Ray Tracing operations, the RT cores are the key to the RTX 2080 and 2080 Ti’s namesake, the NVIDIA RTX Real-time Ray tracing API.
From a data structures perspective, one of the most common ways to accomplish ray tracing at the moment is through the use of something called a Bounding Volume Hierarchy (BVH).
While the more brute force attempts at ray tracing would involve calculating if every individually casted ray intersects with every triangle in the scene, BVH’s are a more optimal way to solve this problem.
At a high level, a BVH is a data structure made up of groups of triangles in a given object/scene. Triangles are grouped in a hierarchical structure so that fewer operations are needed to know test which triangles are intersected by any given ray.
While BVH’s speed up ray tracing on any given hardware compared to more classical brute force methods, NVIDIA has built hardware to specifically accelerate BVH transversal in Turing, in the form of what they are calling RT cores.
Able to run in parallel with other operations of the GPU, the RT cores can perform his BVH transversal while the shaders are rendering the rest of the scene, providing a massive speedup compared to traditional ray tracing methods.
The metric that NVIDIA has come up with to quantify ray tracing performance is the idea of a “Giga Ray.” Through the use of RT cores, NVIDIA is claiming a 10X speedup from the GTX 1080 Ti to the RTX 2080 Ti, from just over 1 Giga Rays to 10 Giga Rays.
The GPU instructions are abstracted through the RTX API, which is compatible with both Microsoft’s DirectX Raytracing (DXR) API, and Vulkan’s Ray Tracing API (soon to come). As long as the developer can generate an appropriate BVH for the given scene, the Turing-based GPU will handle the calculation of ray intersections, which the game engine can then act upon in rendering the scene.
However, while NVIDIA has shown several impressive demos of RTX Ray tracing technology implemented in games such as Shadow of the Tomb Raider and Battlefield V, whether or not developers will implement DXR/RTX into their games remains a big question.
Please focus your 2080 review
Please focus your 2080 review on the comparison between it and the 1080ti not the 1080, as the 2070 is more than likely the true replacement for the 1080. Additionally please include 4k none hdr results in your comparison. I have my 1080ti running at 2,139/12,000 under water and am very curious to see the 2080 can match my card in conventional games 4k none hdr, looking forward to your results.
There’s only one metric by
There’s only one metric by which the 20-series should be compared to the 10-series: PRICE.
Also, this deep-learning AA stuff should be DISABLED for raw benchmarking. If NVIDIA wants to sell us on RTX, they can’t pretend that we only play 7 games they have released profiles for.
Cherry-picked BS won’t be tolerated. Don’t pull a Tom’s Hardware “just buy it” move.
Look a dumb@$$ AMD fanboy
Look a dumb@$$ AMD fanboy trying to dictate the terms of a review.
If DLSS is AA and not upscaling it should be turned on and compared to AMD running comparable Quality AA. Sorry if the numbers will look bad for VEga.
so, where exactly did the
so, where exactly did the other guy mention or even imply AMD fanboyism?
Calling out a company for BS tactics is exactly just that: calling out a company for BS tactics.
Considering I only have
Considering I only have Intel/NVIDIA setups in my house, I’m hardly an AMD fanboy. BS is BS, no matter who’s selling it. Raw performance for the dollar is the only real factor that matters. NVIDIA is pushing ray tracing (very little support) and upscaling tricks (also minimal support) in order to distract from the fact that they are offering – at best – 30% more performance for 80% more money. Like I said, BS.
Agree. The GTX 2080 is ~20 %
Agree. The GTX 2080 is ~20 % more expensive for providing only ~5 % of performance bump compared to the GTX 1080 Ti.
I suspect nVidia’s goal is to make the GTX 1080 Ti look cheap while it’s not.
The World Health Organization is right, video game is an addiction like tobacco, casinos, etc
If only price should be
If only price should be compared, go get a free GPU from a trash recycler. Because it’s free, the price is zero, and the price/perf is for all intents and purposes, infinite. Price, performance, and price/performance all matter.
Test all of the new stuff but it’s probably not a good idea to buy based on future technologies. AMD users who counted on DX12 giving them a big jump later a la FineWine learned this the hard way, and now Nvidia users should learn from that and not make the same mistake.
Conversely it’s idiotic to dismiss DLSS/Ray Tracing when the reviews aren’t even out yet.
I would hope that any
I would hope that any competent reviewer would run benchmarks for both support and unsupported games. And then show the benchmarks for supported games with it on and off.
Nice overview of the
Nice overview of the internals, but how about some actual simple benchmark comparisons with previous gen cards?
Don’t you think Ken would
Don’t you think Ken would have included a performance evaluation if he could? They’re under embargo and were only allowed to talk about the architecture. If you check out the other major websites it’s the same situation everywhere.
What everyone is waiting for,
What everyone is waiting for, to see where they end up for example here:
Are the AIB vendors going to
Are the AIB vendors going to lock down the Nvidia OC scanner feature to only their own cards? At launch Precision X’s scanner worked with my 1080FE only to have later versions say that my card was not supported as it was not “EVGA”.
Did the MSI tool have this restriction?
The previous Precision X
The previous Precision X Scanner was a feature that EVGA implemented and only worked with EVGA cards as you noticed.
NVIDIA Scanner will work with all GPUs, no matter the vendor and can be implemented into any of the NVAPI applications like Afterburner, and software from ASUS, Gigabyte, etc.
“The NVLink interface now
“The NVLink interface now handles Multi-GPU (SLI)”
I would not call NVLink “SLI” any more than I would call AMD’s Infinity Fabric/xGMI “CF” as there is more to NVLink than just some SLI type driver only managed Milti-GPU. NVLink has more hardware based cache coherency protocol communication capabilities for Nvidia’s GPUs(Power9’s to Nvidia GPUs also) and that’s also true for AMD’s Infinity Fabric/xGMI interface, and that xGMI is supported on both Zen and Vega. There is on both NVLink and Infinity Fabric a more direct processor(CPU to GPU and GPU to GPU) cache to cache coherency siginaling capabilities than any SLI/CF driver managed multi-GPU could ever hope to achieve.
I think that both NVLink and Infinity Fabric will allow multiple physical GPUs to appear more like a single larger logical GPU to drivers and software. And this IP has the potential for future modular die based offerings that Both AMD and Nvidia are researching for muiti-die module based GPUs on future products.
Also in both the DX12/Vulkan API’s driver model the GPU’s drivers are simplified and to the metal with any Multi-GPU load balancing given over to the games/gaming engine developers and SLI/CF are depreciated IP that are not going to be used for DX12/Vulkan gaming. Both DX12/Vulkan has that Explicit Mulit-GPU Adaptor IP in their respective APIs that’s managed via these graphics APIs and the game/gaming engine software that makes use of the DX12/Vulkan.
You are also missing the Integer performance(INT32) on that chart where you only list the FP32 performance. It looks like Nvidia may have begun to release more whitepapers and there is probably some patent filings to go over to get some Idea of just what Nvidia has implemented in its RT core hardware. That Tensor Core AI based Denoising needs a deep dive also and I think that Nvidia does the AI algorithm traning on its massive Volta Clusters and then loads that trained AI onto the AI cores on Turing so that process needs a deep dive as well.
I’d expect a continous refinement for the Denoising AI over time in addition to the DLSS AI based Anti Aliasing/other algorithm training. Tensor Cores AI based sound processing and even compression, physics/other AIs are also possible once there are hardware based tensor cores to help speed up the process over any more software based AI solutions.
Also for refrence is Jeffrey A. Mahovsky’s Thesis Paper(1) on Reduced Precision Bounding View Hierarchy (BVH). He is often quoted in many other’s newer papers on the subject.
THE UNIVERSITY OF CALGARY
Ray Tracing with Reduced-Precision Bounding Volume Hierarchies
Jeﬀrey A. Mahovsky”
How about a direct link to
How about a direct link to the Nvidia whitepaper if possible as there is a lot of material to be covered.
Techreport’s article has explaned BVH in a ELI5 manner in their writeup but they do have a copy of the Nvidia whitepaper so if Nvidia has published it on their website a link to the whitepaper would be helpful.
and here it is(1):
“NVIDIA TURING GPU
Graphics Reinvented”<--* *-->[see that phrase that’s marketing’s dirty hands right there, but still the whitepaper is very informative as Nvidia’s whitepapers usually are]
Anyone has any idea (or
Anyone has any idea (or educated guess) if it will be possible to use RayTracing and DLSS at the same time? I mean, they both use tensor cores so would they be competeing for resources? Will there be enough tensor cores to do both?
Raytracing uses RTX cores.
Raytracing uses RTX cores. It only uses the Tensor cores for Denoising which happens at a different “stage”. Nvidia put out a chart showing when each core is active. But the answer is yes.
There is a link to the Nvidia
There is a link to the Nvidia Turing whitepaper right above and why do you not go and read that and then ask questions. That’s where all the Online Tech “Journalists” got their material for their articles on Turing.
If you want some better explanations go over to TechPowerUp’s and the TechReport’s articles on Turing as they are doing a better ELI5 treatments. Don’t bother with Anandtech’s article as you will spend more time swatting at the annoying auto-play ads than you will spend trying to read!
And teach your children to go over to the local College Library and read the Proper Academic and Professional Trade Journals that are usually paywalled online. Most colleges with computer science departments have the proper subscriptions paid to the online Academic and Professional Trade journals that can be accessed via the college library’s web address if you use the library’s available PCs/terminals or maybe even wifi. LexusNexus is a million times better than Google.
P.S. some College Libraries are Student Only but if the College Library is an official Goverment Document Depository Library/Federal Depository Library Program (FDLP) member then that library has to be open to the public by law. The local State University/Junior College libraries are the most open but in large Metropolitan Areas(The Northeast mostly) the homeless have ruined the public access to many private Universities’ Libraries!
But Ray Tracing is done on Turing’s RT cores and the AI is done via the Trained AI running on the Turing Tensor Cores so that implies some possible concurent Ray Tracing(On the RT cores) and Denoising on the Tensor cores as they are different sets of functional blocks on the GPU. The Tensor cores can be used for all sorts of AI based processing including that DLAA and even audio processing can be done on Tensor Cores. Tensor Cores are just Hardware Based matrix math units anyways!
One of the best College electives that I have ever had was a 1-credit hour per semester Library Science Class. That curriculum was offered over a few semesters with different Library Science research and categorization methods learned with each diffferent 1-credit Hour/Semester LS101/LS102 curriculum and that made my research process so much more productive.
Right. No college for my
Right. No college for my kids, don’t live in US or any other country that has college for that mater.
About Ray Tracing – I was thinking that since it uses tensor cores for AI acceleration and DLSS is using it to decode and apply hints for upscaling content (it was stated that the DLSS is running purely on tensor cores and that cuda cores send completed picture to tensor cores to be “upgraded” and then back for further processing down the pipeline). We are all reading about how compute demanding RT is and tensor cores are used for (I think, correct me if I’m wrong here) ray collision detection (or is that one done in RT cores?) and again for AI accelerated denoising – I supposed that tensor cores would be Very Busy with work and may not like the contention for resources. Unless DLSS is actually very light on workload or somehow they are working at different times? (wonder how would that work with consecutive frames going through pipeline).
Several questions for your
Several questions for your review:
RTX – is raytracing performance independent of resolution? I can’t understand why the number of rays traced would change due to resolution.. seems like it should be based on number of lights and number of objects.
Is RTX just on/off or will we have 3 or 4 major features you can toggle? Will there be low, medium, high settings?
DLSS – many people are claiming the speedup comes from rendering at a lower resolution and upscaling. If this is the case PLEASE make sure to compare TRUE 1080p to TRUE 1080p and TRUE 4k to TRUE 4k. Don’t fall for marketing BS. HOWEVER, IF this really is just a superior form of AA with no overhead cost please make sure to say that LOUDLY to shut up all the haters.
Raytracing is done for every
Raytracing is done for every pixel on the screen so it is very dependent on resolution. From every light source a ray is traced to every pixel unless it is directional light then only the subset of pixels affected by the light source is directly traced for.
Then there are all reflections from the pixels that are lighted up by the light source – the complexity rises dramatically.
RTX will have options, quality of shadows, light, reflections and so on based on how many rays are traced and then extrapolated, but there is a lowest limit to how many rays are actually traced because if not enough is traced, the AI thet extrapolates them will fail.
DLSS – as far as Nvidia explained it, it is rendered at target resolution and then AI is used to apply the precalculated cues for textures and objects of how should they look in the “perfect world”. So 4k with TAA should be compared with 4k with DLSS – just as Nvidia is doing on their’s slides.
People clearly don’t
People clearly don’t understand that there is still an NDA in place and all the questions you are asking cannot be answered without severe ramifications until the day everyone and their mother all drops full benchmark articles/videos on the same day like every other release. I’m sure a good portion of those are already made or written and just waiting to go on whatever day the NDA lifts.
This is true and some other
This is true and some other websites have stated that fact and really the FTC should require that all NDA deadlines be published in advance by any device/procesor/whatever makers. This is so consumers can know when the data will be available for the consumer to make an educated purchasing decision!
That Tom’s Hardware USA “Just Buy It” nonsence should have been earning Tom’s hardware USA a big fat fine! But that is the nature of the unregulated online “Press” that’s not held to the same standards as the print and over that air waves TV industries are! Online can be found the same hucksters, grifters, and snake oil salesmen that have been properly regulated out of the Print/TV media by the FCC/FTC for decades.
Most of the “questions” I am
Most of the “questions” I am seeing asked, seem to be more for what they want in the review of the card. Not immediate responses now.