AMD Ryzen and the Windows 10 Scheduler – No Silver Bullet
As it turns out, Windows 10 is scheduling just fine on Ryzen.
** UPDATE 3/13 5 PM **
AMD has posted a follow-up statement that officially clears up much of the conjecture this article was attempting to clarify. Relevant points from their post that relate to this article as well as many of the requests for additional testing we have seen since its posting (emphasis mine):
-
"We have investigated reports alleging incorrect thread scheduling on the AMD Ryzen™ processor. Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture."
-
"Finally, we have reviewed the limited available evidence concerning performance deltas between Windows® 7 and Windows® 10 on the AMD Ryzen™ CPU. We do not believe there is an issue with scheduling differences between the two versions of Windows. Any differences in performance can be more likely attributed to software architecture differences between these OSes."
So there you have it, straight from the horse's mouth. AMD does not believe the problem lies within the Windows thread scheduler. SMT performance in gaming workloads was also addressed:
-
"Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games. Based on our characterization of game workloads, it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this neutral/positive behavior in a wide range of titles, including: Arma® 3, Battlefield™ 1, Mafia™ III, Watch Dogs™ 2, Sid Meier’s Civilization® VI, For Honor™, Hitman™, Mirror’s Edge™ Catalyst and The Division™. Independent 3rd-party analyses have corroborated these findings.
For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready."
We are still digging into the observed differences of toggling SMT compared with disabling the second CCX, but it is good to see AMD issue a clarifying statement here for all of those out there observing and reporting on SMT-related performance deltas.
** END UPDATE **
Editor's Note: The testing you see here was a response to many days of comments and questions to our team on how and why AMD Ryzen processors are seeing performance gaps in 1080p gaming (and other scenarios) in comparison to Intel Core processors. Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture. As it turns out, we can prove that isn't the case at all. -Ryan Shrout
Initial reviews of AMD’s Ryzen CPU revealed a few inefficiencies in some situations particularly in gaming workloads running at the more common resolutions like 1080p, where the CPU comprises more of a bottleneck when coupled with modern GPUs. Lots of folks have theorized about what could possibly be causing these issues, and most recent attention appears to have been directed at the Windows 10 scheduler and its supposed inability to properly place threads on the Ryzen cores for the most efficient processing.
I typically have Task Manager open while running storage tests (they are boring to watch otherwise), and I naturally had it open during Ryzen platform storage testing. I’m accustomed to how the IO workers are distributed across reported threads, and in the case of SMT capable CPUs, distributed across cores. There is a clear difference when viewing our custom storage workloads with SMT on vs. off, and it was dead obvious to me that core loading was working as expected while I was testing Ryzen. I went back and pulled the actual thread/core loading data from my testing results to confirm:
The Windows scheduler has a habit of bouncing processes across available processor threads. This naturally happens as other processes share time with a particular core, with the heavier process not necessarily switching back to the same core. As you can see above, the single IO handler thread was spread across the first four cores during its run, but the Windows scheduler was always hitting just one of the two available SMT threads on any single core at one time.
My testing for Ryan’s Ryzen review consisted of only single threaded workloads, but we can make things a bit clearer by loading down half of the CPU while toggling SMT off. We do this by increasing the worker count (4) to be half of the available threads on the Ryzen processor, which is 8 with SMT disabled in the motherboard BIOS.
SMT OFF, 8 cores, 4 workers
With SMT off, the scheduler is clearly not giving priority to any particular core and the work is spread throughout the physical cores in a fairly even fashion.
Now let’s try with SMT turned back on and doubling the number of IO workers to 8 to keep the CPU half loaded:
SMT ON, 16 (logical) cores, 8 workers
With SMT on, we see a very different result. The scheduler is clearly loading only one thread per core. This could only be possible if Windows was aware of the 2-way SMT (two threads per core) configuration of the Ryzen processor. Do note that sometimes the workload will toggle around every few seconds, but the total loading on each physical core will still remain at ~%50. I chose a workload that saturated its thread just enough for Windows to not shift it around as it ran, making the above result even clearer.
Synthetic Testing Procedure
While the storage testing methods above provide a real-world example of the Windows 10 scheduler working as expected, we do have another workload that can help demonstrate core balancing with Intel Core and AMD Ryzen processors. A quick and simple custom-built C++ application can be used to generate generic worker threads and monitor for core collisions and resolutions.
This test app has a very straight forward workflow. Every few seconds it generates a new thread, capping at N/2 threads total, where N is equal to the reported number of logical cores. If the OS scheduler is working as expected, it should load 8 threads across 8 physical cores, though the division between the specific logical core per physical core will be based on very minute parameters and conditions going on in the OS background.
By monitoring the APIC_ID through the CPUID instruction, the first application thread monitors all threads and detects and reports on collisions – when a thread from our app is running on the same core as another thread from our app. That thread also reports when those collisions have been cleared. In an ideal and expected environment where Windows 10 knows the boundaries of physical and logical cores, you should never see more than one thread of a core loaded at the same time.
Click to Enlarge
This screenshot shows our app working on the left and the Windows Task Manager on the right with logical cores labeled. While it may look like all logical cores are being utilized at the same time, in fact they are not. At any given point, only LCore 0 or LCore 1 are actively processing a thread. Need proof? Check out the modified view of the task manager where I copy the graph of LCore 1/5/9/13 over the graph of LCore 0/4/8/12 with inverted colors to aid viewability.
If you look closely, by overlapping the graphs in this way, you can see that the threads migrate from LCore 0 to LCore 1, LCore 4 to LCore 5, and so on. The graphs intersect and fill in to consume ~100% of the physical core. This pattern is repeated for the other 8 logical cores on the right two columns as well.
Running the same application on a Core i7-5960X Haswell-E 8-core processor shows a very similar behavior.
Click to Enlarge
Each pair of logical cores shares a single thread and when thread transitions occur away from LCore N, they migrate perfectly to LCore N+1. It does appear that in this scenario the Intel system is showing a more stable threaded distribution than the Ryzen system. While that may in fact incur some performance advantage for the 5960X configuration, the penalty for intra-core thread migration is expected to be very minute.
The fact that Windows 10 is balancing the 8 thread load specifically between matching logical core pairs indicates that the operating system is perfectly aware of the processor topology and is selecting distinct cores first to complete the work.
Information from this custom application, along with the storage performance tool example above, clearly show that Windows 10 is attempting to balance work on Ryzen between cores in the same manner that we have experienced with Intel and its HyperThreaded processors for many years.
Pinging Cores
One potential pitfall of this testing process might have been seen if Windows was not enumerating the processor logical cores correctly. What if, in our Task Manager graphs above, Windows 10 was accidentally mapping logical cores from different physical cores together? If that were the case, Windows would be detrimentally affecting performance thinking it was moving threads between logical cores on the same physical core when it was actually moving them between physical cores.
To answer that question we went with another custom written C++ application with a very simple premise: ping threads between cores. If we pass a message directly between each logical core and measure the time it takes for it to get there, we can confirm Windows' core enumeration. Passing data between two threads on the same physical core should result in the fastest result as they share local cache. Threads running on the same package (as all threads on the processors technically are) should be slightly slower as they need to communicate between global shared caches. Finally, if we had multi-socket configurations that would be even slower as they have to communicate through memory or fabric.
Let's look at a complicated chart:
What we are looking at above is how long it takes a one-way ping to travel from one logical core to the next. The line riding around 76 ns indicates how long these pings take when they have to travel to another physical core. Pings that stay within the same physical core take a much shorter 14 ns to complete. The above example was a 5960X and confirms that threads 0 and 1 are on the same physical core, threads 2 and 3 are on the same physical core, etc.
Now lets take a look at Ryzen on the same scale:
There's another layer of latency there, but let us focus on the bottom of the chart first and note that the relative locations of the colored plot lines are arranged identically to that of the Intel CPU. This tells us that logical cores within physical cores are being enumerated correctly ({0,1}, {2,3}, etc.). That's the bit of information we were after and it validates that Windows 10 is correctly enumerating the core structure of Ryzen and thus the scheduling comparisons we made above are 100% accurate. Windows 10 does not have a scheduling conflict on Ryzen processors.
But there are some other important differences standing out here. Pings within the same physical core come out to 26 ns, and pings to adjacent physical cores are in the 42 ns range (lower than Intel, which is good), but that is not the whole story. Ryzen subdivides by what is called a "Core Complex", or CCX for short. Each CCX contains four physical Zen cores and they communicate through what AMD calls Infinity Fabric. That piece of information should click with the above chart, as it appears hopping across CCX's costs another 100 ns of latency, bringing the total to 142 ns for those cases.
While it was not our reason for performing this test, the results may provide a possible explanation for the relatively poor performance seen in some gaming workloads. Multithreaded media encoding and tests like Cinebench segment chunks of the workload across multiple threads. There is little inter-thread communication necessary as each chunk is sent back to a coordination thread upon completion. Games (and some other workloads we assume) are a different story as their threads are sharing a lot of actively changing data, and a game that does this heavily might incur some penalty if a lot of those communications ended up crossing between CCX modules. We do not yet know the exact impact this could have on any specific game, but we do know that communicating across Ryzen cores on different CCX modules takes twice as long as Intel's inter-core communication as seen in the examples above, and 2x the latency of anything is bound to have an impact.
Some of you may believe that there could be some optimization to the Windows scheduler to fix this issue. Perhaps keep processes on one CCX if at all possible. Well in the testing we did, that was also happening. Here is the SMT ON result for a lighter (13%) workload using two threads:
See what's going on there? The Windows scheduler was already keeping those threads within the same CCX. This was repeatable (some runs were on the other CCX) and did not appear to be coincidental. Further, the example shown in the first (bar) chart demonstrated a workload spread across the four cores in CCX 0.
Closing Thoughts
What began as a simple internal discussion about the validity of claims that Windows 10 scheduling might be to blame for some of Ryzen's performance oddities, and that an update from Microsoft and AMD might magically save us all, has turned into a full day with many people chipping in to help put together a great story. The team at PC Perspective believes strongly that the Windows 10 scheduler is not improperly assigning workloads to Ryzen processors because of a lack of architecture knowledge on the structure of the CPU.
In fact, though we are waiting for official comments we can attribute from AMD on the matter, I have been told from high knowledge individuals inside the company that even AMD does not believe the Windows 10 scheduler has anything at all to do with the problems they are investigating on gaming performance.
In the process, we did find a new source of information in our latency testing tool that clearly shows differentiation between Intel's architecture and AMD's Zen architecture for core to core communications. In this way at least, the CCX design of 8-core Ryzen CPUs appears to more closely emulate a 2-socket system. With that, it is possible for Windows to logically split the CCX modules via the Non-Uniform Memory Access (NUMA), but that would force everything not specifically coded to span NUMA nodes (all games, some media encoders, etc) to use only half of Ryzen. How does this new information affect our expectation of something like Naples that will depend on Infinity Fabric even more directly for AMD's enterprise play?
There is still much to learn and more to investigate as we find the secrets that this new AMD architecture has in store for us. We welcome your discussion, comments, and questions below!
A likely candidate for the
A likely candidate for the problem could be the NVidia TX 1080.
Asynchronous Compute And Asynchronous Shader Pipelines just does not work with NVidia hardware. And the driver emulation provided by NVidia is also a poor solution.
Since nobody is talking about this and given the controversy regarding the 3d Mark Time Spy benchmarks, a conclusion cold be drawn that the inability of GTX 1080 to support Asynchronous Compute may be the culprit.
So what happens when you
So what happens when you disable 4 cores and thus effectively eliminate the Infinity Fabric traffic?
Theoretically it should perform much closer to say, a 7700K, then. In fact, perhaps not a bad idea to see what happens when you run both at 4.0 too; compare everything from 8C/16T, 8C/8T and 4C/8T.
“With that, it is possible
“With that, it is possible for Windows to logically split the CCX modules via the Non-Uniform Memory Access (NUMA), but that would force everything not specifically coded to span NUMA nodes (all games, some media encoders, etc) to use only half of Ryzen.”
Allyn Malventano, I think you should chance that “all games” to something else because there are games, quite many in fact which are coded to work with numa systems.
Yes, this is possible, but
Yes, this is possible, but then every application not specifically coded to be NUMA aware will then be restricted to half of the available cores.
Before you all mob-gather to
Before you all mob-gather to burn the witches
Have you geniuses thought to look at DX11 or how game engines, or for that matter ‘to the metal’ could be affecting all of this?? After All numa-aware programs work fine.
I don’t understand the intel
I don’t understand the intel latency? Did you measure communication core by core or did you ping all cores at the same time.
Maybe somebody can explaine me how the intel ring bus is working.
For example core 1 and core 4 must communicate and core 2 and 5 must communicate at the same time. How is the behaviour of the Ringbus? Must core 2 and 5 wait till 1 and 4 is finished, or can they communicate all the time?
If Intel 8 cores are both
If Intel 8 cores are both Numa and UMA aware – How does Intel 8 core CPU’s achieve competitive 2 core FPS scores vs 7700? Curious
Great stuff, I think these
Great stuff, I think these type of simple benchmarks, that directly measure the performance of CPU subsystems should be in all CPU reviews. It could be a page,
A big thanks to PCPer for
A big thanks to PCPer for this depth in.
I’d like to point out that Intel has CPUs with multiple ringbuses, the MCC and HCC variants of Haswell/Broadwell-EP. picture
It would be interesting to see how the latency is between the two ringbuses.
Can you possibly release the source code or repeat the test on a corresponding Xeon?
Although, that CPU’s should not appear much on consumer clients, they might on workstations. It would be interesting to know, if the Windows scheduler is aware of that and tries to prevent bouncing in between the different ringbuses.
Great Write up Allyn Thank
Great Write up Allyn Thank you..
I would like to see testing done on Windows 7 because i am testing my self here with a couple of Ryzen builds and win 7 seems to be handling much better ..I am getting fantastic minimums on 7 than 10…..
When I test with intels MLC
When I test with intels MLC software then the local socket L2->L2 HIT latency is 38ns and L2->L2 HITM latency is 42 ns. This is for a non-overclockable Xeon E5 2640v4 with 25mb cache (2.7 ghz L3 cache). Does MLC work with AMD CPU’s? I would like to know the numbers for the Ryzen CPU. Perhaps the HIT and HITM numbers are very different? I suspect the Infinity Fabric is used for moving data if cachelines are unmodified? Otherwise data must be written to DRAM first?
Also, please provide the Ryzen numbers for loaded latency.
>custom written C++
>custom written C++ application with a very simple premise: ping threads between cores
Would you please release your source code [no need to tidy it up or add comments]? Several of us would like to experiment by executing your exact code after compiling it with different compilers, and with different background processes and services controlled.
Custom tweaked from cut and
Custom tweaked from cut and paste, or NDA related sort of things! Go to The Anandtech Ryzen: Strictly technical forum and Malventano is also taking part in that discussion. There are whole testing software/SDK packages available to help in that testing code creation process.
And wait untl the server market pros get their hands on the RTM Zen/Naples SKUs! There will be plenty of things sussed out there to assure top performance for any workloads that may tax the Zen/Naples and Zen/Ryzen(By Extention) CCX units/Infinity Fabric/cache subsystems in a negative way.
Windows 7 Non-Uniform Memory
Windows 7 Non-Uniform Memory Access Architectures http://news.softpedia.com/news/Windows-7-Non-Uniform-Memory-Access-Architectures-100885.shtml
Is there any correlation to Windows O/S after Win7 and DirectX after Win7??
The problem is the penalty
The problem is the penalty when the scheduler moves a thread across the CCXs, which are like NUMA nodes (or rather, should simply be considered two separate CPUs altogether like a dual CPU system).
The penalty is huge for that. Xeons and other systems are properly handled by Windows 10 scheduler. Ryzen is not. It presents itself as a single CPU, so windows doesn’t care about scheduling across CCXs. And you take about a 10x penalty for that (22gb/s vs 200gb/s+, roughly) for moving between the IF (Intel called theirs QPI I think).
This article is really ignorant. You need to change the title, and particularly that bolded editor’s note at the top.
It’s up to AMD to identify
It's up to AMD to identify the need for NUMA segmentation in their CPUID. That's not the schedulers fault.
Ryzen is not truly a NUMA as
Ryzen is not truly a NUMA as that would mean each CCX would
have its own RAM tied to it.
And it is scheduler task to take cache into consideration when scheduling where to run thread.
It seems that cache was wrongly detected for a time or detection was/is unreliable.
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/#post-38770528
Scheduler update or correct cache detection may not be silver bullet but would help.
Ryzen chips aren’t actually
Ryzen chips aren’t actually NUMA as each core can access all of main memory with the same latency. That’s what NUMA schedulers are meant to address. In true NUMA systems, the goal of a NUMA aware scheduler is to keep a process on a core as close to its memory as it can.
OS support for NUMA goes beyond just the scheduler, it is in the memory allocator as well–you only want to allocate memory to a process that is local to the core it’s bound to.
The situation with Ryzen is more like the first ‘dual core’ Intel chips which had two cores (each with local L2) which shared access to main memory via the FSB. The difference is that the two CCX in Ryzen share access to main memory internally instead of through a shared FSB and they have local L2 and L3.
It’s not difficult to belive that Windows 10 would need a little tuning to support that configuration. It’s also not difficult to believe that Windows 7 retains support for it. One could imagine the conversation “Shall we keep support for Smithfield?” “If people run Win10 on Smithfield, they shouldn’t expect the best performance, so no, let’s drop support for that.”
So, no, AMD was not wrong to not set the CPUID info to declare the two CCX as NUMA (because they’re not). I don’t think an OS could get enough information from the CPUID info to properly support this–unless you include the CPU VendorID/Family/Model info in that. Then you’d need a table of ‘quirks’ that say, oh, yeah, treat the first four physical cores as one processor and the second four as another and schedule appropriately.
And, it may go beyond that. Win10 may have dropped support for that Smithfield type of scheduling because of the reasonable expectation that Win 10 wasn’t going to run on a dusty old Pentium D.
Hey Allyn, this was a great
Hey Allyn, this was a great read that finally seems to diagnose what is really going on here with the R7 series. My question is are the 6 and 4 core variants the same architecture in the sense that they will also be a 2 module 3 core for the six core and a 2 module 2 core for the quad core Ryzens? Or do we not know this information yet? If not it would fix this gaming weakness correct? Since it would eliminate the cross module communication.
The 6 core will be 3 per CCX,
The 6 core will be 3 per CCX, but I really hope the quad use some only one CCX because plenty of things will want to be able to hit 4 logical cores without having to spill over onto another CCX.
I think you really
I think you really overestimate the amount of latency-sensitive IPC of common workloads and underestimate the value of larger cache pools.
Most people will not be trying to do FEA on their 6c R5-1600X or whatever.
So then in theory the 6 core
So then in theory the 6 core variant would further magnify the problem of the 8 core processors since even lighter loads would have to cross ccx complexes as compared to the R7s? I would assume the 4 core variant would have no need for separate modules and simply be one half of the R7 series. Correct me if I’m on the wrong in thinking this. Thanks
AMD made alot of compromises
AMD made alot of compromises to save money, so the design of 4 core ccx is all they got as far as i know, so the 4 core is definitely 1 ccx, 6 cores might be 4+2 or 3+3, but the 4+2 makes more sense when considering the current issue with infinity fabric.
and again infinity fabric’s problem is not speed, it’s the overall limit, tweaking targeted balance according to size and priority to get rid off unnecessary back and forth between ccx, will go along way into helping latency compared to all the current random swaping of windows.
so the bottom line Ryzen can swap between CCXs without losing performance, just need to keep the fabric’s load within reason.
Allyn Malventano – “but I
Allyn Malventano – “but I really hope the quad use some only one CCX”
Why would AMD not have a 4 core die when that is the processor that will account for the bulk of their sales? If they have a 32 core Naples on the way they surely did a clean 4 core design for Ryzen 3 and 5?
Reportedly, Ryzen 5 will be
Reportedly, Ryzen 5 will be the same 2 CCX as Ryzen 7 but with one core disabled on each. (See here: https://www.pcper.com/news/Processors/AMD-Launching-Ryzen-5-Six-Core-Processors-Soon-Q2-2017)
What we are wondering now is whether AMD's scalable 2 CCX design will continue that trend to scale down to Ryzen 3 by disabling two cores on each CCX for a quad core (4c/8t) part or if AMD will use a single CCX for Ryzen 3 either by turning off a CCX from the normally 2 CCX monolithic die or a Raven Ridge sans GPU die which would be a single CPU CCX (speculating here that is, I don't know for 100% certainty RR dies are setup that way). In our work chat I was debating this and the binning benefits/strategies/product segmentations possible with both options. My guess, knowing AMD is going to be AMD, it will probably be a mixture of both where Ryzen 3 chips can be either 2 CCX with two cores each or one CCX enabled, one disable for 4 cores on one CCX and customers will just have to play the silicon lottery to get chips that have the single CCX type of binning. This would give them the most salvageable dies from binning for the product that will be the highest volume/most mainstream part but of course has the wrinkle of some chips not performing some workloads the same due to this inter-CCX latency. Shrug… could go either way heh. I hope that it is a single CCX for simplicity and performance sake though!
Edit: as for the 32 core Naples, it is actually four 8c/16t 2-CCX dies on package rather than a monstrous 32 core monolithic die. https://www.pcper.com/news/Processors/AMD-Prepares-Zen-Based-Naples-Server-SoC-Q2-Launch
How to Set Processor Affinity
How to Set Processor Affinity to an Application in Windows 7 https://www.sevenforums.com/tutorials/83632-processor-affinity-set-applications.html
Information
Processor affinity or CPU Pinning enables the binding and un-binding of a process or thread to a physical CPU or a range of CPUs, so that the process or thread in question will run only on the CPU or range of CPUs in question, rather than being able to run on any CPU
By default, Vista and Windows 7 runs an application on all available cores of the processor. If you have a multi-core processor, then this tutorial will show you how to set the processor affinity for an application to control which core(s) of the processor the application will run on.
If you look at these
If you look at these tests:
http://www.zolkorn.com/en/amd-ryzen-7-1800x-vs-intel-core-i7-7700k-mhz-by-mhz-core-by-core-en/view-all/
The NUMA effectively shutting off the last 8 cores of a Ryzen processor wouldn’t necessarily be a bad thing, since a ryzen with 4 cores an half of its L3 cache disabled competes very well with an i7-7700k.
It could be a very simple solution for AMD and the Motherboard makers until NUMA support can be added to games.
And as an added suggestion you really should take a look into the compiler bias “conspiracy theories “, since Intel settled that code out of court for more than 1billion dollars there could be some truth to it.
This article has good initial
This article has good initial data collection but questionable analysis and a premature conclusion, since SMT core allocation and inter-core/CCX latencies are only part of the picture.
Something that could murder Zen performance but will not affect smaller Haswells/Broadwells much is process migration across arbitrary cores, that is, back and forth across CCXs. Smaller Xeons (and their -E workstation counterparts) have every core sitting on a single ring bus with cache lines hashed or otherwise interleaved by low-order address bits across the set. At worst, an i7-5960X needs to rewarm its L2 cache, which in 256 kiB and has a 64B pipe to L3.
On the other hand, Zen has twice as large L2s at 512 kiB, only 32B lanes to L3, and higher latencies to the remote CCX, so full cache rewarming will take at least 4x more time even in the case that all the needed lines are still in the remote L3 cluster.
Before claiming that scheduling has nothing to do with the issue, it needs to be measured and provable that the scheduler does not lightly bounce given threads across CCXs on a moderately to heavy system load.
As a final note, it seems disappointing that the scheduler put 4 moderately busy threads on the same CCX, since most workloads would benefit more from bigger shares of local L3 than from lower inter-core latency.
They said that SMT scheduling
They said that SMT scheduling is not the issue, not that there are no scheduling issues.
Another conspiracy theory for
Another conspiracy theory for you: AMD deliberately didn’t try to fix the Windows 10 NUMA problem so they would have an excuse for letting the Motherboard companies release Windows 7 drivers. Why? To work around their internal agreements to support only Windows 10 and still satisfy the 49% of gamers who still use Windows 7?
I’m with you Spen…
I’m with you Spen… Thought the same thing.
What a great move by AMD.. lol
No not really! If anything
No not really! If anything may hold true for Ryzen getting windows 7 support, it will be any Zen/Napels business clients that may be using windows 7 and have a lot of money tied up in their mission critical software that is only certified to work under windows 7. The biggest IT expense in most businesses is in their custom software/mission critical software that cost way more than the cost of a single OS/OS licenses to develop. That’s why XP was supported for some after XP went EOL, and 7 is the new XP!
If any server client wants Zen/Naples and there is millions to be made both AMD and M$ will make some exceptions to the rules that no one will hear about because of NDAs in legal contracts!
Yeah, SMT itself works just
Yeah, SMT itself works just fine. It’s when threads get migrated across CCXs (or have heavy dependencies across the CCX boundary) that it causes problems.
Nothing to do with SMT really. Just affinity issues with how threads are placed/moved. Win 10 has no problem placing a thread that needs to share cache data from one CCX onto the other one. Makes no difference if it’s a SMT thread or not.
Very good analysis PCPer –
Very good analysis PCPer – thanks for doing this.
Do we really think that there is a software ‘fix’ for the gaming performance issue? The Ryzen is a great MT CPU but is clearly down on ST (and therefore “lightly threaded”) performance vs. Intel.
Is the gap in gaming performance much larger than expected vs. the single/four threaded performance gap between Ryzen 7 and 7700K (or 6900K in 4-thread mode)?