NUMA and UMA – the memory locality concern
The structure and design of the Threadripper creates an interesting situation for AMD. While having two Zen dies on a single CPU works, it means that there are distributed memory controllers and cores, and communication between them is more latent in some instances. Users who are familiar with the intricacies of NUMA on multi-socket systems are already aware of what this means.
To quote from Wikipedia: Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.
Essentially, because the 8 cores on die 1 (as an example) can access the memory attached to controller on the same die more quickly than it can access the memory on the controller on die 2, it makes sense in some cases for the operating system and applications to be aware of that and to adjust the workload accordingly. But not always.
If you have ever dealt with multi-socket systems (which Threadripper closely emulates) and have had to work around non-NUMA aware applications, you will know of the potential headache it can cause.
Luckily for us, AMD thought ahead on this one and has enabled two different memory access modes to help address any issues of performance or compatibility that might arise for the memory design of Threadripper. Set in either the BIOS or in its Ryzen Master software, AMD will offer users a Distributed (UMA) or a Local (NUMA) memory mode.
- Distributed Mode
- Distributed Mode places the system into a Uniform Memory Access (UMA) configuration, which prioritizes even distribution of memory transactions across all available memory channels. Distributing memory transactions in this fashion improves overall memory bandwidth and performance for creative applications that typically place a premium on raw bandwidth. This is the default configuration for the AMD Ryzen™ Threadripper™ CPU reflecting its primary purpose as a creator’s powerhouse.
- Local Mode
- Local Mode places the system into a Non-Uniform Memory Access (NUMA) configuration, which allows each die to prioritize transactions within the DIMMs that are physically nearest to the core(s) processing the associated workload. Localizing memory contents to the nearest core(s) improves overall latency for gaming applications that tend to place a premium on fast memory access.
In NUMA/Local mode, the system will report to Windows as having two distinct NUMA nodes (0/1) with 8 cores each. The operating system then attempts to keep workloads that share memory on the same nodes, hoping to reduce the impact of higher latency memory accesses. However, spill over can occur, in both memory capacity and thread capacity. When you exceed the amount of memory on the memory controller on a single NUMA node (say you have 32GB total, 16GB to each die, but your workload uses 20GB), then some memory on the other die will need to be used at the expense of higher latency. If your application can use more than 16 threads (from the 8 cores on a single Zen die), then it will also spill over onto the other die. This situation actually is worse than the memory spill over as it means half of the threads will be accessing memory on the OTHER die the entire time (assuming the workload uses less than 16GB in the above example).
In general, keeping the system in UMA/Distributed mode will result in the best experience for the consumer, especially one that works with highly threaded applications that can utilize the power of the CPU. In this mode, memory is evenly distributed to both memory controllers on both die, meaning that some threads will still access memory across the die (at a higher latency), but on average it will be lower for highly threaded applications.
The primary pain point that AMD hopes to address with the NUMA mode is gaming, where they have identified (as has the community) instances where games can suffer from the longer latencies associated with threads that happen to be placed across the die by Windows or the game itself. AMD says that over a testing regiment of more than 75 games, putting a Threadripper system into the NUMA mode nets an average of +5% in average frame rate, with occasional peaks of 10%. Our testing mirrors that implication, though we didn’t have time to go through 75 games.
There is a third mode for users to be aware of as well, though not directly related to memory access modes. In Legacy Compatibility Mode, the number of available cores is cut in half, with each die having access to 4 cores on the 1950X (its 3 cores each die on the 1920X). AMD says this will give the Threadripper processors performance equivalent to the Ryzen 7 1800X or Ryzen 5 1600X, though you do so at the expense of half the cores you paid for. (At least until you change the setting and reboot.) If you think you will find yourself in this mode for the majority of the time, you’d be better off saving some cash and just buying that Ryzen 7 1800X processor.
AMD found a few games, notably Dirt Rally and Far Cry Primal, which have bugs preventing the application from loading correctly when more than 20 logical cores are detected. You can either enable this legacy mode to play them or disable SMT as well.
Complications around high core count processors will not be unique to AMD and Intel will deal with the same types of issues when its own 12+ core CPUs hit the market later this year. Intel will not have to deal with significant memory access concerns though thanks to its single, monolithic die design. I am interested to see what advantages this may offer Intel Skylake-X.
Testing Core to Core Latency on Threadripper
During the release window of the Ryzen 7 processor, we at PC Perspective used some custom applications to test the real-world latency of various architectures. We found that because of the design of the Zen architecture, with its CCX and Infinity Fabric integration, core to core latency was noticeably longer than in previous multi-core designs from both AMD and Intel. Because the two CCX (core complexes) on each Ryzen die communicated through a unique fabric, the latency between them was higher than cores that exist on each individual CCX. The memory latency between the four cores on each CCX was around 40ns, while the latency between any two cores on opposing CCXs was near 140ns. It was because of this latency that 1080p gaming performance and some other similar, latency dependent workloads took a hit on Ryzen that they did not exhibit on other CPUs.
With Threadripper (and EPYC actually), AMD has another potential hop of memory latency between threads running on different physical die. Let’s see what that looks like.
Okay, there’s a lot going on here and it is reasonable to assert that it’s near impossible to follow every line or every data point it showcases. What is most important to understand is that there are four distinct levels of latency on the Threadripper CPU: per-core, per-CCX, per-die, and cross-die. When running at DDR4 2400 MHz memory speeds (which directly relates to the speed of the Infinity Fabric), the memory latency for threads sharing the same core is ~21ns and for threads on the same CCX about ~48ns. When we cross from a CCX to another CCX on the same physical die, latency jumps to ~143ns, identical to what we measured on the Ryzen 7/5/3 family of CPUs. However, once memory accesses need to cross from one die to the next, latency jumps to over 250ns.
Increasing the memory speed to 3200 MHz shows considerable decreases in memory latency. Our four latencies drop to 20ns for on-die and 45ns for on-CCX; these gains are smaller as they aren’t impacted as much by the Infinity Fabric implementation. Crossing from CCX to CCX though we see latency drops to 125ns (14% faster) and going from die to die shows latency of 203ns (23% faster). These are significant performance gains for Threadripper and indicates that we will see performance advantages to higher clocked memory on multi-threaded workloads that have high memory latency dependencies.
For comparison, here is the same tool run on a dual-socket Xeon E5-2680 v2 platform we happen to have in the office. Based on Ivy Bridge-E, this 4 year old machine has surprisingly similar metrics to the Threadripper processor when it comes to memory latency. Notice that there are only three distinct levels of performance (though there is plenty of variance at the top), showing us an on-core latency, on die, and cross-die result. The QPI interface used to connect the two Intel Xeon processors averages somewhere around 240ns of latency to cross between the two physical sockets.
Finally, here is the look at latency from thread zero across to thread 31 (limited to keep the graph readable, the Xeon remains the same after that). The architecture, die layout, and Infinity Fabric design clearly present a unique arrangement of memory for software and OS developer to work with. AMD will continue to fight the issues around memory latency for its platforms and the move to a multi-die configuration has increased that latency by one more, but still significant, step.