Application Profiling Tells the Story
We might have found another, and more relevant, data point to answer the 1080p gaming concerns for Ryzen.
It should come as no surprise to anyone that has been paying attention the last two months that the latest AMD Ryzen processors and architecture are getting a lot of attention. Ryzen 7 launched with a $499 part that bested the Intel $1000 CPU at heavily threaded applications and Ryzen 5 launched with great value as well, positioning a 6-core/12-thread CPU against quad-core parts from the competition. But part of the story that permeated through both the Ryzen 7 and the Ryzen 5 processor launches was the situation surrounding gaming performance, in particular 1080p gaming, and the surprising delta that we see in some games.
Our team has done quite a bit of research and testing on this topic. This included a detailed look at the first asserted reason for the performance gap, the Windows 10 scheduler. Our summary there was that the scheduler was working as expected and that minimal difference was seen when moving between different power modes. We also talked directly with AMD to find out its then current stance on the results, backing up our claims on the scheduler and presented a better outlook for gaming going forward. When AMD wanted to test a new custom Windows 10 power profile to help improve performance in some cases, we took part in that too. In late March we saw the first gaming performance update occur courtesy of Ashes of the Singularity: Escalation where an engine update to utilize more threads resulted in as much as 31% average frame increase.
As a part of that dissection of the Windows 10 scheduler story, we also discovered interesting data about the CCX construction and how the two modules on the 1800X communicated. The result was significantly longer thread to thread latencies than we had seen in any platform before and it was because of the fabric implementation that AMD integrated with the Zen architecture.
This has led me down another hole recently, wondering if we could further compartmentalize the gaming performance of the Ryzen processors using memory latency. As I showed in my Ryzen 5 review, memory frequency and throughput directly correlates to gaming performance improvements, in the order of 14% in some cases. But what about looking solely at memory latency alone?
At the outset of the Ryzen product rollout, AMD cautioned media that some traditional synthetic benchmark applications might need to be updated to properly show the performance of a brand-new, ground-up architecture like Zen. Included in that are applications like SiSoft Sandra and AIDA64 that look at memory bandwidth, latency, cache speeds, etc. But in truth, based on my conversations with several benchmark developers, memory latency testing is one of the more straight forward tests. (Even SiSoftware felt confident enough in its testing to write an editorial evaluating Ryzen performance in April.)
There are three tests in the Sandra suite for memory latency: full random, in-page and sequential. Sequential memory testing shows the lowest times of latency because it measures numerically sequential memory location access times. Those sequential accesses are easily prefetched by a modern CPU and thus are cached. Full random testing is exactly what it says as well – a fully random memory walk that will counter the TLB and pre-fetch systems, resulting in the worst-case scenario for memory latency.
In-page testing is more complex in that it attempts to balance between full random and sequential. Ryzen and Kaby Lake can map about 6MB of memory (1,536 4K pages) but as soon as the application wants to reference more, each access will miss the TLB window and force a page-walk, adding more accesses and TLB miss cost to the latency. With the in-page test, Sandra attempts to minimize page walks by randomly accessing data in a smaller-than-TLB window, then moving on to another full window. This assures the latency test it will perform a page walk page and not once per access.
This graph shows the results from SiSoft Sandra’s memory latency test as well as an Intel Memory Latency Checker I tossed in for good measure. Clearly, the actual memory latency of an AMD Ryzen processor is slower that of Intel Kaby Lake.
The Ryzen 7 1800X is slower in all three methods from Sandra, but is proportionally slower with the in-page result, coming in 3.6x slower than the Core i7-7700K. By comparison, under the full random scenario, the Ryzen 7 1800X is 56% slower. Even on the sequential test, the Ryzen part is 45% slower.By comparison, the Intel Memory Latency Checker puts the latency comparison somewhere in between SiSoft Sandra’s fully random and in-page result. The Ryzen 7 1800X reports roughly twice the latency (92% slower) of the 7700K.
These numbers are an improvement over the launch results that many media reviews were seeing with Ryzen 7. AMD worked closely with the motherboard vendors to find ways to optimize the BIOS and default settings to improve memory efficiency and roundtrip latency. This was something that was on-going from launch day of Ryzen 7, through the Ryzen 5 release and honestly continues today. AMD still wants to bump up default supported memory speeds (that will, by definition, improve memory latencies) and help spread knowledge that buying faster than DDR4-2400 memory is the best course of action for AMD Ryzen buyers.
It is also worth noting that I ran these tests with the Ryzen 7 platform at slightly tighter / faster timings, though both running at 2400 MHz. The same Corsair memory was running at a 1T command rate on the AMD system while we had it set to 2T on Intel. (This was simply a result of out-of-box settings with this memory on each platform and mirrors the settings we used in our initial Ryzen 7 and Ryzen 5 processor reviews in March.)
Using Intel vTune to Measure Application Sensitivity
With that data in hand, I wanted to profile different applications and games to determine how much of an impact we would expect memory bandwidth or latency to have on them. Intel’s vTune application is built for exactly this – it runs counters in the background and measures the impact of each instruction, and memory request on the system. Intel vTune is used by software developers to see how their applications perform and to optimize. Getting the best information out of this kind of tool requires very specialized knowledge of the architecture and because of that, vTune does not work on Ryzen processors. And since AMD still doesn’t have a publicly available toolset to optimize the Zen architecture or to provide the kind of data that vTune provides, we have to limit part of our exploration to Intel platforms.
In this example result page you can see essentially any kind of metric you would like to gauge, and if you dig down even deeper, you can analyze any application on a per-function, per-instruction basis. While I don’t have time (or the background) to detail everything, there are interesting results to see. Anything that is in red font is assumed to have a negative effect on application performance, though to what degree depends on any number of factors. Take the CPI rate as an example (clocks per instruction), an average result across the 120s of captured system profiling. A result of 1.323 is considered poor. This is from one of our tested games; a result from Handbrake for example is 0.755, running at a higher than one instruction executed per clock rate. Looking under the processor back end and the memory bound rate, a result of 45% indicates that in 45% of processor clock cycles, the memory system was the limiting factor of performance. As we narrow it down further, you can see that memory latency shows a 29% instruction sensitivity, meaning 29% of the active clock cycles were waiting on a memory request to return. While we do not expect this to ever be zero, we found that game workloads tend to show much higher dependency on memory latency than most other benchmarked application workloads.
The CPU usage histogram is also interesting to look at in benchmarks and games. In this title, the average CPU utilization is just under 4 threads, indicated as “poor” with the default Intel vTune profile. You would like to see this weighted more towards the right of the graph, indicating that all 8 threads of the Core i7-7700K are being utilized for that particular workload.
While slightly out of order, this seems like a good point to mention an important characteristic of memory latency. Depending on simultaneous multi-threading utilization and coding practices, you can “hide” memory latency by distributing work in such a way that the system is rarely dependent on outstanding memory requests at any given point. Lower core/thread utilization is not always indicative of higher resulting sensitivity to memory latency, but it is more likely that a program that is optimized to use threads more efficiently on any given architecture will be less susceptible to any memory latency deficiencies of a given processor design. For an application to prevent dependency on memory latency, it would have to focus on data layout optimizations, access patterns and even software prefetching. These are non-trivial design goals and would likely require a lot of development effort on the complex data structures found in games.
What does a range of applications and games, typical of those used in reviews, show when looking specifically at the memory latency sensitivity of the workload through Intel vTune?
Games, with the lone exception of the Civilization 6 graphics test, show a fairly high memory latency dependency, ranging from 21.7% to 29.3%. In comparison, general applications, with the exception of WinRAR, show very low latency dependency, in the mid- to low-teens. WinRAR is an interesting example as the specific workload uses a dictionary file that is quite large, often and repeatedly exceed any TLB table sizes of the Kaby Lake processors.
Interesting..
Interesting..
maybe the architecture’s
maybe the architecture’s ability to scale so well was seen as being more advantageous than latency
not sure it is “hiding” as opposed to taking advantage of its strengths, which are clearly about the future of software, or so it seems to me
goes back to what we already knew: ryzen gaming performance gets better with thread counts, just like sheets
it is all about trade-offs, and making the right ones looking ahead
what i get from this piece is a very subtle, or not so subtle, attempt to disparage yet another new amd part, but that could be from my strong desire to see amd flourish for the benefit of all concerned
in any case, no matter how you want to frame it, ryzen is an amazing achievement and a great cpu and will only get better
yeah if I were to buy a CPU
yeah if I were to buy a CPU today it would be the 1600X. It’s really quite good.
“Hiding latency” is not meant
"Hiding latency" is not meant as a criticism to AMD. It is a standard industry term to reflect the idea of using computing and threading to minimize the negative effects of memory interfaces.
Most things that makes a CPU
Most things that makes a CPU complicated is involved in trying to hide latency. The whole point of a cache is to hide latency it’s basically the only reason it’s there.
Interesting article. I am
Interesting article. I am still fairly new to this site, but you guys have put out some solid ryzen coverage.
One thing I would like to request…when vega releases and you do a review of it, could you also test AMD+intel CPUs when comparing vega to whatever cards you choose from nvidia?
The reason why I would like to see this is because I’ve seen a few cases where Nvidia cards just don’t work well on ryzen. Check this out for example, and look at the difference between the 1060 and the 480
http://www.anandtech.com/show/11244/the-amd-ryzen-5-1600x-vs-core-i5-review-twelve-threads-vs-four/14
I have also seen this behavior reported in rise of the tomb raider. I would really like to see an article which investigates this to see if it’s an actual problem or just a few edge cases.
Yeah, it’s something we are
Yeah, it's something we are considering. Especially in light of the Ryzen launch.
Good read! I would’ve
Good read! I would’ve appreciated even some speculation on where the difference in memory latency/performance between platforms comes from. Is it easily remedied by, say, increasing the frequency of the dedicated silicon in the next generation or is it more complicated than that?
Good point.
I would expect
Good point.
I would expect some improvements in the fabric to come in the second generation. If possible, something as simple as increasing the clock rate at which it runs (currently half of memory speed) would help.
Latency is the hard problem
Latency is the hard problem to solve. much much easier to get good bandwidth.
Your line graph could use a
Your line graph could use a little explanation or a caption. I get it now after staring at it for 2 minutes and it is an excellent piece of data but unless you already knew that the processor is a zeppelin die it wouldn’t make any sense.
It’s starting to become
It’s starting to become painfully obvious how much outdated game engines are holding back progress in the CPU/GPU industry. It’s almost pathetic that a tiny private company (Stardock) can create a core neutral engine that takes great advantage of explicit API’s and includes multi-gpu support while big name AAA publishers are still screwing around with overhauled DX9 engines that have been dragged into the DX11 era.
Could one test have been done
Could one test have been done with more data point at 2400 ?
Like see the impact of CAS 10 to 18 in one game sensitive to latency (like hitman)
Also AMD said they reduce latency by 6ns in the last microcode update… did that have ANY impact on performance ?
Side note: I have yet to see other sites do in depth analysis like this. Vtune can indeed tell you a lot.
I even wonder if some of the game developers even know this tool exist…
@Ryan
last page
Ashes of the
@Ryan
last page
Ashes of the Singularity is going to be the poster child for this going forward but I hope that Intel is working with the other major vendors (UE, Unity) to implement similar changes.
Intel>>>AMD?
Whoops, yes!
Whoops, yes!
@ Ryan, was there any
@ Ryan, was there any correlation between average fps and latency dependency between the various games?
To me, latency issues are very dependent on frame rates. Lower frame rates allow for much more latency hiding. In other words, when the benchmarks run at very high fps, then performance should be more sensitive to latency.
This also brings me to the question of reasonable fps. Even if a system can run 200fps in a benchmark, is it still a reasonable benchmark? Most users will not run a game at 200fps, but instead will increase the quality of the game (twitch fps shooters excluded). And thus, a wider architecture with more latency may actually be a better choice, given it can hide its latence in a 60-100fps scenario.
In short, I’m not sure, but I’m not confident with conclusions based on very high fps benchmarks.
Very nice write up on this
Very nice write up on this Thank You.
I have played a lot with my own system settings and memory timings to try to eek out every bit of FPS I can form my system.
I am running on a i7 2700K OC’d to it’s max and my memory is topped out at 2133Mhz the max the Sandy’s support. BY using FSB to advance the memory a bit I got it at DDR3 2200Mhz. I played with timings for a long time until I got 100% stable and max bandwidth I could get from this system. Aida64 memory tests show my latency around 42.6 now and my bandwidth about 32-35GB’s between the read write and copy tests.
With my system settings CPU above 5Ghz and the memory settings I have found that the games run very smooth I do not notice stutters except for poorly code games. So you are 100% right games do like as low of latency as possible on the memory subsystem.
The gains in min & Avg is well worth the extra effort of tuning the system. Now if AMD can tune their Ryzen’s CPU’s with firmware bios updates I see then getting some FPS gains as well but I would not expect them gen of Ryzen to match Intel’s kady-lakes in the FPS department @ 1080p Ryzen also has a large Clock rate challenge as well when pitted up against the 7700K CPU.
Very curious about memory
Very curious about memory sensitivity on CPU bound games.
“As a part of that dissection
“As a part of that dissection of the Windows 10 scheduler story, we also discovered interesting data about the CCX construction and how the two modules on the 1800X communicated. The result was significantly longer thread to thread latencies than we had seen in any platform before and it was because of the fabric implementation that AMD integrated with the Zen architecture.”
Any platform? What about Core2Quad?