NUMA and UMA – the memory locality concern
The structure and design of the Threadripper creates an interesting situation for AMD. While having two Zen dies on a single CPU works, it means that there are distributed memory controllers and cores, and communication between them is more latent in some instances. Users who are familiar with the intricacies of NUMA on multi-socket systems are already aware of what this means.
To quote from Wikipedia: Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.
Essentially, because the 8 cores on die 1 (as an example) can access the memory attached to controller on the same die more quickly than it can access the memory on the controller on die 2, it makes sense in some cases for the operating system and applications to be aware of that and to adjust the workload accordingly. But not always.
If you have ever dealt with multi-socket systems (which Threadripper closely emulates) and have had to work around non-NUMA aware applications, you will know of the potential headache it can cause.
Luckily for us, AMD thought ahead on this one and has enabled two different memory access modes to help address any issues of performance or compatibility that might arise for the memory design of Threadripper. Set in either the BIOS or in its Ryzen Master software, AMD will offer users a Distributed (UMA) or a Local (NUMA) memory mode.
- Distributed Mode
- Distributed Mode places the system into a Uniform Memory Access (UMA) configuration, which prioritizes even distribution of memory transactions across all available memory channels. Distributing memory transactions in this fashion improves overall memory bandwidth and performance for creative applications that typically place a premium on raw bandwidth. This is the default configuration for the AMD Ryzen™ Threadripper™ CPU reflecting its primary purpose as a creator’s powerhouse.
- Local Mode
- Local Mode places the system into a Non-Uniform Memory Access (NUMA) configuration, which allows each die to prioritize transactions within the DIMMs that are physically nearest to the core(s) processing the associated workload. Localizing memory contents to the nearest core(s) improves overall latency for gaming applications that tend to place a premium on fast memory access.
In NUMA/Local mode, the system will report to Windows as having two distinct NUMA nodes (0/1) with 8 cores each. The operating system then attempts to keep workloads that share memory on the same nodes, hoping to reduce the impact of higher latency memory accesses. However, spill over can occur, in both memory capacity and thread capacity. When you exceed the amount of memory on the memory controller on a single NUMA node (say you have 32GB total, 16GB to each die, but your workload uses 20GB), then some memory on the other die will need to be used at the expense of higher latency. If your application can use more than 16 threads (from the 8 cores on a single Zen die), then it will also spill over onto the other die. This situation actually is worse than the memory spill over as it means half of the threads will be accessing memory on the OTHER die the entire time (assuming the workload uses less than 16GB in the above example).
In general, keeping the system in UMA/Distributed mode will result in the best experience for the consumer, especially one that works with highly threaded applications that can utilize the power of the CPU. In this mode, memory is evenly distributed to both memory controllers on both die, meaning that some threads will still access memory across the die (at a higher latency), but on average it will be lower for highly threaded applications.
The primary pain point that AMD hopes to address with the NUMA mode is gaming, where they have identified (as has the community) instances where games can suffer from the longer latencies associated with threads that happen to be placed across the die by Windows or the game itself. AMD says that over a testing regiment of more than 75 games, putting a Threadripper system into the NUMA mode nets an average of +5% in average frame rate, with occasional peaks of 10%. Our testing mirrors that implication, though we didn’t have time to go through 75 games.
There is a third mode for users to be aware of as well, though not directly related to memory access modes. In Legacy Compatibility Mode, the number of available cores is cut in half, with each die having access to 4 cores on the 1950X (its 3 cores each die on the 1920X). AMD says this will give the Threadripper processors performance equivalent to the Ryzen 7 1800X or Ryzen 5 1600X, though you do so at the expense of half the cores you paid for. (At least until you change the setting and reboot.) If you think you will find yourself in this mode for the majority of the time, you’d be better off saving some cash and just buying that Ryzen 7 1800X processor.
AMD found a few games, notably Dirt Rally and Far Cry Primal, which have bugs preventing the application from loading correctly when more than 20 logical cores are detected. You can either enable this legacy mode to play them or disable SMT as well.
Complications around high core count processors will not be unique to AMD and Intel will deal with the same types of issues when its own 12+ core CPUs hit the market later this year. Intel will not have to deal with significant memory access concerns though thanks to its single, monolithic die design. I am interested to see what advantages this may offer Intel Skylake-X.
Testing Core to Core Latency on Threadripper
During the release window of the Ryzen 7 processor, we at PC Perspective used some custom applications to test the real-world latency of various architectures. We found that because of the design of the Zen architecture, with its CCX and Infinity Fabric integration, core to core latency was noticeably longer than in previous multi-core designs from both AMD and Intel. Because the two CCX (core complexes) on each Ryzen die communicated through a unique fabric, the latency between them was higher than cores that exist on each individual CCX. The memory latency between the four cores on each CCX was around 40ns, while the latency between any two cores on opposing CCXs was near 140ns. It was because of this latency that 1080p gaming performance and some other similar, latency dependent workloads took a hit on Ryzen that they did not exhibit on other CPUs.
With Threadripper (and EPYC actually), AMD has another potential hop of memory latency between threads running on different physical die. Let’s see what that looks like.
Okay, there’s a lot going on here and it is reasonable to assert that it’s near impossible to follow every line or every data point it showcases. What is most important to understand is that there are four distinct levels of latency on the Threadripper CPU: per-core, per-CCX, per-die, and cross-die. When running at DDR4 2400 MHz memory speeds (which directly relates to the speed of the Infinity Fabric), the memory latency for threads sharing the same core is ~21ns and for threads on the same CCX about ~48ns. When we cross from a CCX to another CCX on the same physical die, latency jumps to ~143ns, identical to what we measured on the Ryzen 7/5/3 family of CPUs. However, once memory accesses need to cross from one die to the next, latency jumps to over 250ns.
Increasing the memory speed to 3200 MHz shows considerable decreases in memory latency. Our four latencies drop to 20ns for on-die and 45ns for on-CCX; these gains are smaller as they aren’t impacted as much by the Infinity Fabric implementation. Crossing from CCX to CCX though we see latency drops to 125ns (14% faster) and going from die to die shows latency of 203ns (23% faster). These are significant performance gains for Threadripper and indicates that we will see performance advantages to higher clocked memory on multi-threaded workloads that have high memory latency dependencies.
For comparison, here is the same tool run on a dual-socket Xeon E5-2680 v2 platform we happen to have in the office. Based on Ivy Bridge-E, this 4 year old machine has surprisingly similar metrics to the Threadripper processor when it comes to memory latency. Notice that there are only three distinct levels of performance (though there is plenty of variance at the top), showing us an on-core latency, on die, and cross-die result. The QPI interface used to connect the two Intel Xeon processors averages somewhere around 240ns of latency to cross between the two physical sockets.
Finally, here is the look at latency from thread zero across to thread 31 (limited to keep the graph readable, the Xeon remains the same after that). The architecture, die layout, and Infinity Fabric design clearly present a unique arrangement of memory for software and OS developer to work with. AMD will continue to fight the issues around memory latency for its platforms and the move to a multi-die configuration has increased that latency by one more, but still significant, step.
I’m very curious on how will
I’m very curious on how will the two dies and memory modes affect virtualization? I’ve only experimented with VM in the past but is it possible to run two Hexa-cores windows VM and with each individual memory nodes assigned to each VM?
Are you setting the Blender
Are you setting the Blender tile sizes to 256 or 16/32?
Just wondering since an overclocked 5960x gets 1 minute 30 seconds on the BMW at 16×16 tile size. Significant difference that shouldn’t just be a result of the OC.
For reference: 256 or 512 are for GPU and 16 or 32 are for CPU – at least for getting the best and generally more comparable results to what we get over at BlenderArtists.
When reading is not enough,
When reading is not enough, the mistakes are OVER 9000!
“If you content creation is your livelihood or your passion, ”
” as consumers in this space are often will to pay more”
” Anyone itching to speed some coin”
” flagship status will be impressed by what the purchase.”
” but allows for the same connectivity support that the higher priced CPUs.”
“”””Editor””””
Now just point me to the
Now just point me to the pages… 😉
Nice to see a review with
Nice to see a review with more than a bunch of games tested. Keep up the good work!
Should not a test like 7-zip
Should not a test like 7-zip use 32 threads as max since that is what is presented to the OS??
now it only uses 50% of the threads on TR but 80% on i9-7900x.
Silly performance, looking
Silly performance, looking forward to the 1900X and maybe 1900.
I sometimes wonder why nobody
I sometimes wonder why nobody ever points out that within CCX (4 cores that can allow a lot of games to run comfortably) ZEN has latencies of half those of Intel CPUs. Binding a game to those 4 cores (8 threads like any i7) has significant impact on performance. It does not change memory latencies of course but core to core is much better.
I’m glad someone else noticed
I’m glad someone else noticed this besides myself. I noted this during the Ryzen launch & quickly noted that by using CPU affinity along w CPU priority to force my games to run exclusively within 1 CCX & take advantage of using high CPU processing time on these same CPU cores I could take advantage of this up to a point.
What all this shows to me is that the OS & game developers software need to be revised to better handle this architecture at the core logic level instead of usersAMD having to provideuse methods to try to do this that cannot be used in a more dynamic fashion. I’ve ran some testing on Win 10’s Game Mode & discovered that MS is actually trying to use CPU affinity to dynamically set running game threads to be run on the fastestlowest latency CPU cores to “optimize” game output thru the CPU but it still tends to cross the CPU CCX’s at times if left on it’s own.
What I’ve found is by doing this my games run much smoother w a lot less variance which gives the “feel” of games running faster (actual FPS is the same) due to lower input lag & much better GPU frametime variance graph lines w very few spikes….essentially a fairly flat GPU frametime variance line which is what you want to achieve performance-wise.
Just to note….my box is running an AMD Ryzen 7 1800X CPUSapphire R9 Fury X graphics card w no OC’s applied to either the CPU or GPU.
It’s a step in the right direction but needs more refinement at the OS level……
As expected, performance per
As expected, performance per dollar is crap in single threaded tasks, which most workloads are. Games don’t even use more than 1 or 2 cores.
Yea games only use 2 cores
Yea games only use 2 cores lol
http://i.imgur.com/Hg3Ev5p.png
And “as expected”, we have
And “as expected”, we have yet another Intel shill complaining about gaming performance on a production CPU, which isn’t made for gaming (although it’s not bad in the least and has a longer future as devs code for more than Intel’s tiny core count (under $1000))..
-“performance per dollar is crap in single threaded workloads”…
Well, since these aren’t sold as a single or dual core CPU, performance per dollar as a unit is beyond everything on Intel’s menu.
– “Games don’t even use more than 1 or 2 cores”
Well, I’ve been using a FX-8350 for 2 years now, and I always see all 8 cores loaded up on every single game I play (and I have many). Windows 10 makes use of these cores even when it’s not coded in programs. It would work even better if devs started coding for at least 8 cores, and I believe they will start doing this in earnest now that 8-core CPUs are now considered average core counts (unless you’re with Intel).
You would have been better off stating that core vs core is in Intel’s favor on the 4-core chips and some others, but ironically the “performance per dollar”, as you mention is superior with AMD.. in every way.
What memory are you using,
What memory are you using, and could you name a 64GB kit that works in XMP? And why 3200Mhz over 3600?
Intel is still superior both
Intel is still superior both in raw performance and in perf/$. If you were being objective you wouldn’t have given slapped an editor’s choice on this inferior product.
In Handbrake the 1800x is 40%
In Handbrake the 1800x is 40% slower than the 1950x and in reverse the 1950x is 67% faster than 1800x.
Open cinebench with a TR or
Open cinebench with a TR or even an 1800x. Show me any Intel chip that can come within 20% of the 1950x. The entire Ryzen 7 lineup is king of the “perf/$” category. 1800x = $365 on eBay right now. Look how close it matches with Intel products that are double the price or worse.
If you want to compare single core perf vs Intel, you can win an argument.. at the cost of very high power draw and even worse cash draw. Perf/$ is a dead argument for any Intel fanboy. Find something else. BTW, are you also commenting under “Thatman007” or something? Sound like the same Intel mouthpiece.
Sorry for necroposting, but
Sorry for necroposting, but it really belongs here:
The recent Meltdown vulnerability and its performance implications on Intel CPUs pretty much leveled the playground now. After reading the article and all the comments above I opted for a very good B350 motherboard and a Ryzen 1800X to replace my Core i7 5930K (Haswell). Reason is that my CPU will likely be hit very badly performance wise by the upcoming Windows 10 security update. Intel should pay back 30% to all affected CPU owners, actually…
Reason is that likely I would not gain anything from NUMA, except of the additional complications. So I opt for the easier to manager (lower) power consumption and less noise from cooling as a result.
Thank you for collecting all the great info.