AMD Ryzen and the Windows 10 Scheduler – No Silver Bullet
As it turns out, Windows 10 is scheduling just fine on Ryzen.
** UPDATE 3/13 5 PM **
AMD has posted a follow-up statement that officially clears up much of the conjecture this article was attempting to clarify. Relevant points from their post that relate to this article as well as many of the requests for additional testing we have seen since its posting (emphasis mine):
-
"We have investigated reports alleging incorrect thread scheduling on the AMD Ryzen™ processor. Based on our findings, AMD believes that the Windows® 10 thread scheduler is operating properly for “Zen,” and we do not presently believe there is an issue with the scheduler adversely utilizing the logical and physical configurations of the architecture."
-
"Finally, we have reviewed the limited available evidence concerning performance deltas between Windows® 7 and Windows® 10 on the AMD Ryzen™ CPU. We do not believe there is an issue with scheduling differences between the two versions of Windows. Any differences in performance can be more likely attributed to software architecture differences between these OSes."
So there you have it, straight from the horse's mouth. AMD does not believe the problem lies within the Windows thread scheduler. SMT performance in gaming workloads was also addressed:
-
"Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games. Based on our characterization of game workloads, it is our expectation that gaming applications should generally see a neutral/positive benefit from SMT. We see this neutral/positive behavior in a wide range of titles, including: Arma® 3, Battlefield™ 1, Mafia™ III, Watch Dogs™ 2, Sid Meier’s Civilization® VI, For Honor™, Hitman™, Mirror’s Edge™ Catalyst and The Division™. Independent 3rd-party analyses have corroborated these findings.
For the remaining outliers, AMD again sees multiple opportunities within the codebases of specific applications to improve how this software addresses the “Zen” architecture. We have already identified some simple changes that can improve a game’s understanding of the "Zen" core/cache topology, and we intend to provide a status update to the community when they are ready."
We are still digging into the observed differences of toggling SMT compared with disabling the second CCX, but it is good to see AMD issue a clarifying statement here for all of those out there observing and reporting on SMT-related performance deltas.
** END UPDATE **
Editor's Note: The testing you see here was a response to many days of comments and questions to our team on how and why AMD Ryzen processors are seeing performance gaps in 1080p gaming (and other scenarios) in comparison to Intel Core processors. Several outlets have posted that the culprit is the Windows 10 scheduler and its inability to properly allocate work across the logical vs. physical cores of the Zen architecture. As it turns out, we can prove that isn't the case at all. -Ryan Shrout
Initial reviews of AMD’s Ryzen CPU revealed a few inefficiencies in some situations particularly in gaming workloads running at the more common resolutions like 1080p, where the CPU comprises more of a bottleneck when coupled with modern GPUs. Lots of folks have theorized about what could possibly be causing these issues, and most recent attention appears to have been directed at the Windows 10 scheduler and its supposed inability to properly place threads on the Ryzen cores for the most efficient processing.
I typically have Task Manager open while running storage tests (they are boring to watch otherwise), and I naturally had it open during Ryzen platform storage testing. I’m accustomed to how the IO workers are distributed across reported threads, and in the case of SMT capable CPUs, distributed across cores. There is a clear difference when viewing our custom storage workloads with SMT on vs. off, and it was dead obvious to me that core loading was working as expected while I was testing Ryzen. I went back and pulled the actual thread/core loading data from my testing results to confirm:
The Windows scheduler has a habit of bouncing processes across available processor threads. This naturally happens as other processes share time with a particular core, with the heavier process not necessarily switching back to the same core. As you can see above, the single IO handler thread was spread across the first four cores during its run, but the Windows scheduler was always hitting just one of the two available SMT threads on any single core at one time.
My testing for Ryan’s Ryzen review consisted of only single threaded workloads, but we can make things a bit clearer by loading down half of the CPU while toggling SMT off. We do this by increasing the worker count (4) to be half of the available threads on the Ryzen processor, which is 8 with SMT disabled in the motherboard BIOS.
SMT OFF, 8 cores, 4 workers
With SMT off, the scheduler is clearly not giving priority to any particular core and the work is spread throughout the physical cores in a fairly even fashion.
Now let’s try with SMT turned back on and doubling the number of IO workers to 8 to keep the CPU half loaded:
SMT ON, 16 (logical) cores, 8 workers
With SMT on, we see a very different result. The scheduler is clearly loading only one thread per core. This could only be possible if Windows was aware of the 2-way SMT (two threads per core) configuration of the Ryzen processor. Do note that sometimes the workload will toggle around every few seconds, but the total loading on each physical core will still remain at ~%50. I chose a workload that saturated its thread just enough for Windows to not shift it around as it ran, making the above result even clearer.
Synthetic Testing Procedure
While the storage testing methods above provide a real-world example of the Windows 10 scheduler working as expected, we do have another workload that can help demonstrate core balancing with Intel Core and AMD Ryzen processors. A quick and simple custom-built C++ application can be used to generate generic worker threads and monitor for core collisions and resolutions.
This test app has a very straight forward workflow. Every few seconds it generates a new thread, capping at N/2 threads total, where N is equal to the reported number of logical cores. If the OS scheduler is working as expected, it should load 8 threads across 8 physical cores, though the division between the specific logical core per physical core will be based on very minute parameters and conditions going on in the OS background.
By monitoring the APIC_ID through the CPUID instruction, the first application thread monitors all threads and detects and reports on collisions – when a thread from our app is running on the same core as another thread from our app. That thread also reports when those collisions have been cleared. In an ideal and expected environment where Windows 10 knows the boundaries of physical and logical cores, you should never see more than one thread of a core loaded at the same time.
Click to Enlarge
This screenshot shows our app working on the left and the Windows Task Manager on the right with logical cores labeled. While it may look like all logical cores are being utilized at the same time, in fact they are not. At any given point, only LCore 0 or LCore 1 are actively processing a thread. Need proof? Check out the modified view of the task manager where I copy the graph of LCore 1/5/9/13 over the graph of LCore 0/4/8/12 with inverted colors to aid viewability.
If you look closely, by overlapping the graphs in this way, you can see that the threads migrate from LCore 0 to LCore 1, LCore 4 to LCore 5, and so on. The graphs intersect and fill in to consume ~100% of the physical core. This pattern is repeated for the other 8 logical cores on the right two columns as well.
Running the same application on a Core i7-5960X Haswell-E 8-core processor shows a very similar behavior.
Click to Enlarge
Each pair of logical cores shares a single thread and when thread transitions occur away from LCore N, they migrate perfectly to LCore N+1. It does appear that in this scenario the Intel system is showing a more stable threaded distribution than the Ryzen system. While that may in fact incur some performance advantage for the 5960X configuration, the penalty for intra-core thread migration is expected to be very minute.
The fact that Windows 10 is balancing the 8 thread load specifically between matching logical core pairs indicates that the operating system is perfectly aware of the processor topology and is selecting distinct cores first to complete the work.
Information from this custom application, along with the storage performance tool example above, clearly show that Windows 10 is attempting to balance work on Ryzen between cores in the same manner that we have experienced with Intel and its HyperThreaded processors for many years.
Pinging Cores
One potential pitfall of this testing process might have been seen if Windows was not enumerating the processor logical cores correctly. What if, in our Task Manager graphs above, Windows 10 was accidentally mapping logical cores from different physical cores together? If that were the case, Windows would be detrimentally affecting performance thinking it was moving threads between logical cores on the same physical core when it was actually moving them between physical cores.
To answer that question we went with another custom written C++ application with a very simple premise: ping threads between cores. If we pass a message directly between each logical core and measure the time it takes for it to get there, we can confirm Windows' core enumeration. Passing data between two threads on the same physical core should result in the fastest result as they share local cache. Threads running on the same package (as all threads on the processors technically are) should be slightly slower as they need to communicate between global shared caches. Finally, if we had multi-socket configurations that would be even slower as they have to communicate through memory or fabric.
Let's look at a complicated chart:
What we are looking at above is how long it takes a one-way ping to travel from one logical core to the next. The line riding around 76 ns indicates how long these pings take when they have to travel to another physical core. Pings that stay within the same physical core take a much shorter 14 ns to complete. The above example was a 5960X and confirms that threads 0 and 1 are on the same physical core, threads 2 and 3 are on the same physical core, etc.
Now lets take a look at Ryzen on the same scale:
There's another layer of latency there, but let us focus on the bottom of the chart first and note that the relative locations of the colored plot lines are arranged identically to that of the Intel CPU. This tells us that logical cores within physical cores are being enumerated correctly ({0,1}, {2,3}, etc.). That's the bit of information we were after and it validates that Windows 10 is correctly enumerating the core structure of Ryzen and thus the scheduling comparisons we made above are 100% accurate. Windows 10 does not have a scheduling conflict on Ryzen processors.
But there are some other important differences standing out here. Pings within the same physical core come out to 26 ns, and pings to adjacent physical cores are in the 42 ns range (lower than Intel, which is good), but that is not the whole story. Ryzen subdivides by what is called a "Core Complex", or CCX for short. Each CCX contains four physical Zen cores and they communicate through what AMD calls Infinity Fabric. That piece of information should click with the above chart, as it appears hopping across CCX's costs another 100 ns of latency, bringing the total to 142 ns for those cases.
While it was not our reason for performing this test, the results may provide a possible explanation for the relatively poor performance seen in some gaming workloads. Multithreaded media encoding and tests like Cinebench segment chunks of the workload across multiple threads. There is little inter-thread communication necessary as each chunk is sent back to a coordination thread upon completion. Games (and some other workloads we assume) are a different story as their threads are sharing a lot of actively changing data, and a game that does this heavily might incur some penalty if a lot of those communications ended up crossing between CCX modules. We do not yet know the exact impact this could have on any specific game, but we do know that communicating across Ryzen cores on different CCX modules takes twice as long as Intel's inter-core communication as seen in the examples above, and 2x the latency of anything is bound to have an impact.
Some of you may believe that there could be some optimization to the Windows scheduler to fix this issue. Perhaps keep processes on one CCX if at all possible. Well in the testing we did, that was also happening. Here is the SMT ON result for a lighter (13%) workload using two threads:
See what's going on there? The Windows scheduler was already keeping those threads within the same CCX. This was repeatable (some runs were on the other CCX) and did not appear to be coincidental. Further, the example shown in the first (bar) chart demonstrated a workload spread across the four cores in CCX 0.
Closing Thoughts
What began as a simple internal discussion about the validity of claims that Windows 10 scheduling might be to blame for some of Ryzen's performance oddities, and that an update from Microsoft and AMD might magically save us all, has turned into a full day with many people chipping in to help put together a great story. The team at PC Perspective believes strongly that the Windows 10 scheduler is not improperly assigning workloads to Ryzen processors because of a lack of architecture knowledge on the structure of the CPU.
In fact, though we are waiting for official comments we can attribute from AMD on the matter, I have been told from high knowledge individuals inside the company that even AMD does not believe the Windows 10 scheduler has anything at all to do with the problems they are investigating on gaming performance.
In the process, we did find a new source of information in our latency testing tool that clearly shows differentiation between Intel's architecture and AMD's Zen architecture for core to core communications. In this way at least, the CCX design of 8-core Ryzen CPUs appears to more closely emulate a 2-socket system. With that, it is possible for Windows to logically split the CCX modules via the Non-Uniform Memory Access (NUMA), but that would force everything not specifically coded to span NUMA nodes (all games, some media encoders, etc) to use only half of Ryzen. How does this new information affect our expectation of something like Naples that will depend on Infinity Fabric even more directly for AMD's enterprise play?
There is still much to learn and more to investigate as we find the secrets that this new AMD architecture has in store for us. We welcome your discussion, comments, and questions below!
PCper makes claims that are
PCper makes claims that are trivially easy proven to be false.
Win 10 clearly has negative effect of Ryzen performance when compared to Win7 performance so clearly PCper claims Win10 performs adequately when it comes to Ryzen is blunt patent falsehood. For all i know all those test they claim that they have conducted are pure fiction as well. PCper stoops at now low with this kind of deliberate consumer misinformation ad deceit.
Is there a link you can share
Is there a link you can share that shows the performance difference between the two? I’ve seen the wccftech articles conjecturing this, but no hard data?
i didn’t find any articles
i didn’t find any articles with solid comparaison between 7 and 10 yet, all i found is like a video and some forum posts.
i would love to see any article if you have a link.
https://www.reddit.com/r/Amd/
https://www.reddit.com/r/Amd/comments/5xkhun/total_war_warhammer_windows_7_is_faster_than/
Finally found this link
The system tests were identical except they used different FPS measuring software in Win 7 vs Win 10. Ryzen @ 3.5 GHz.
Basically –
Windows 7 didn’t show any performance difference with SMT ON vs OFF.
Windows 7 is 10% faster in minimum FPS for SMT OFF and 20-25% with SMT ON than Windows 10 (!).
The OP did mention this was the game he had showing the biggest difference.
https://www.youtube.com/watch
https://www.youtube.com/watch?v=U9DE83lMVio
i am pretty sure that if AMD
i am pretty sure that if AMD released the 4 cores along side the 8 core, none of this would’v happened, the lack of an alternative is what fueled the gaming debate.
4cores for 1080p
8 cores for 1440p/4k
My thoughts exactly…
My thoughts exactly…
There are some benchmarks
There are some benchmarks being run using some AM4 motherboards with the features in the MB’s UEFI/BIOS enabled that allow 8 core Ryzen SKUs to be disabled down to 4 cores 8 threads and the benchmarks show the half enabled Ryzen(1800X) SKU doing not so bad compared to the 7700K when both where clocked at 4.0 GHz and benchmarks were run.
So maybe there will have to be a windows optimization that allows for setting affinity by the CCX unit and games designed to keep any draw call workloads/threads dispatched to a particular CCX unit kept on that CCX unit with no dependent cross thread draw call workloads spread across both CCX units. The problem happens when any draw call workloads are bumped by the OS scheduler before the work is completed on that thread’s workload, with threads/tasks residing one CCX unit’s core/s being transferred(Before workload completion) to the other CCX units core/s and that extra cache latency hop has to take place over the Infinity fabric to handle the extra cache coherency traffic that results. The processor thread’s needed cache Data/Code distribution across CCX units boundary becomes fragmented across the CCX units boundary requiring extra latency inducing steps to get cached Code/Data accessed.
So if any work(Draw Call, Other) is dispatched to one CCX unit’s core/s the that work should stay there until completion with no moving of that workload outside of the CCX unit boundary while the work task is ongoing. New draw call work should be allowed to be initially dispatched to one or the other CCX unit but once assigned to a core on a particular CCX unit that entire workload and thread task should stay on the same CCX unit until task(Draw Call/other) completion.
There are even some people suggesting that the Ryzen CCX units be logically split NUMA like so the OS can treat each CCX unit like a 2P/2 socket system logically.
So yes it’s not that windows is not doing the job it was initially designed to do with respect to dispatching threads. It’s that the current way that windows dispatches threads in its OS is not optimized for Ryzen or Ryzen’s new CCX hardware construct that has better latency performance within(Intra-CCX) the CCX unit than latency performance across the CCX units(Inter-CCX).
P.S. AMD designed the CCX
P.S. AMD designed the CCX UNIT with modular scalable usage and fabrication in mind so future SKUs can be scaled up or down by the CCX unit. This modular design methodology allows for Better die/wafer yields to be had via smaller sized die production. It also allows for a very scalable lower cost way of adding processor power in increments to any future SKUs. The same modular design methodology will be applied to the Navi based GPU designs and even any future APU designs where CPU and GPU chiplets can be scaled up or down to produce products for all markets laptop to desktop on up to server and supercomputer.
Thanks for investigating
Thanks for investigating this, Allyn and Ryan. Even if you’re proving a negative, it’s interesting stuff.
Neat to see all of the layers involved, factors, etc…
Even though Ryzen performance delivers on the 40% promised, 52% delivered IPC boost over their AM3 socket processors, I was one of the fanboys hoping they could match Skylake or Kaby Lake IPC.
I’m still pleased with Ryzen. I figure there are millions of gamers happily using Haswell, Ivy Bridge, even Sandy Bridge, so an R7 1700 or an R5 1600X should let me play most games with acceptable performance on the CPU side for years.
Allyn, do you think that
Allyn, do you think that either AMD or MS could possibly look into patching the scheduler to read as if it is a dual Xeon part? I mean it has two CCx’s connected. I mean could it be possible to see if will work that way as if it is a actual two cpu part. I’m just curious what do you think about that.
Delidded Ryzen 7s have 2
Delidded Ryzen 7s have 2 chips. A CCX IS a separate chip. Ryzen 5s will have 3 cpus per chip, using the one cpu defective chips. I’d bet Ryzen 3s will have 2 cpus per CCX. Eventually yields will rise. I remember when Phenom IIs were bought in cheap 2 and 3 proc versions and the unused cpus unlocked to get all 4. That might happen for Ryzensm eventually.
Ryzen is one monolithic die.
Ryzen is one monolithic die. If you look closely at the delidded CPUs you can see that the die is continuous. What makes it look like 2 dies is the two solder patches are square and appear as two units.
You guys need to focus on
You guys need to focus on cross-CCX thread switching, this is same issue as the dual Jaguar modules in the consoles where devs were told to never cross modules or latency goes through the roof. Now throw what is effectively an L3 flush when this happens on Zen and an aggressive OS and you have a recipe for poor gaming performance. I’m not giving AMD a pass because they should have seen this coming, but to say that the OS and the applications are not responsible at all is likely incorrect.
He is referring to this
He is referring to this https://www.reddit.com/r/Amd/duplicates/5ybrxn/ryzen_7_is_actually_behaving_like_a_dual_4c8t/
kind of like Nvidia’s 970 3.5 fiasco but like 1000x more shitty for gamers.
I like the graphs and the
I like the graphs and the work done and forgive me, but what is exactly is exactly new ?
“Most assuredly that Windows scheduler had no business on Ryzen issues”. Still, just like everyone else, can’t really point a finger on what’s is exactly wrong on there – which most assuredly means, not sure.
Bleh
Great write up, glad to see
Great write up, glad to see some real testing for the problem rather than all the arm chair nonsense being plastered all over the web.
I’m curious to see if the win7 vs win10 argument has any validity and whether or not there is any actual performance being left on the table.
that crap has been in dev for
that crap has been in dev for at least 4 years, is supposed to be W10 only, and they just realized NOW that there is something wrong…. with W10. Bunch of clowns.
Lol, this dropped off the AMD
Lol, this dropped off the AMD fanboi reddit pretty quick.
The Butthurt is Strong with this one – even though the product is really good esp. in the price / perf but unless it beats every CPU in existence today, the fanbois will cry… What a shame…
They’re at it again it seems.
They’re at it again it seems. Microsoft themselves have acknowledged the problem, they should know…
https://www.guru3d.com/news-story/microsoft-confirms-windows-bug-is-holding-back-amd-ryzen.html
And conveniently avoiding to make comparisons with Win 7’s better implementation for the Zen architecture which sees some massive improvements in some games:
Quote:
“All of these were recorded at 3.5GHz, 2133MHz MEMCLK with R9 Nano:
Windows 10 – 1080 Ultra DX11:
8C/16T – 49.39fps (Min), 72.36fps (Avg)
8C/8T – 57.16fps (Min), 72.46fps (Avg)
Windows 7 – 1080 Ultra DX11:
8C/16T – 62.33fps (Min), 78.18fps (Avg)
8C/8T – 62.00fps (Min), 73.22fps (Avg)”
Just explain this^^, I’m waiting… And why not TEST IT YOURSELF instead?
Source: https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/page-8#post-38775732
Would also be good to see the
Would also be good to see the same game on an Intel processor, R9 Nano, on both platforms..
It is possible that the
It is possible that the windows 7 scheduled was. It designed to handle 16 core processors and it is basically keeping threads limited to one CCX by accident. It will be interesting to see some more testing.
It is possible that the
It is possible that the windows 7 scheduled was. It designed to handle 16 core processors and it is basically keeping threads limited to one CCX by accident. It will be interesting to see some more testing.
Intel Paper on Hyperthreaded
Intel Paper on Hyperthreaded games:- https://software.intel.com/en-us/articles/multithreaded-game-programming-and-hyper-threading-technology
“False sharing can cause some serious performance degradation on both dual- or multi-processor and HT-enabled systems. False sharing happens when multiple threads own private data on the same cache block. For Pentium 4 processors and Xeon processors, a cache block is effectively 128-bytes. Determining whether false sharing is an issue is easy with a profile from the VTune Analyzer, and fixing the problem is as easy as padding out data structures. Perhaps an even better fix is to structure data in a cache-friendly manner, on or at 128-byte boundaries. Note that these recommendations are very complimentary to those for avoiding 64K-aliasing, so watching out for one pitfall actually helps you prevent two or more! See item [5] in the Additional Resources section for a more in-depth explanation of false sharing”
Other resources:-
http://www.iuma.ulpgc.es/~nunez/procesadoresILP/Pentium4/Pentium%204%20IA32%20Processor%20Genealogy%20f39-20%20HT%20Performance%20OS%20Issues.htm
http://www.agner.org/optimize/blog/read.php?i=6
https://mechanical-sympathy.blogspot.co.uk/2011/07/false-sharing.html
The work load you tested wasn’t GPU intensive like gaming, you might want to also check for DPC latency as Nvidia drivers have been notorious for high latency requiring hot fixes!
While the scheduler seem to
While the scheduler seem to be working properly in your tests, I can’t help but notice CPU load all over 16 threads in many online reviews. So the questions is: why is it happening? The game code bypasses windows scheduler?
http://deasproject.altervista
http://deasproject.altervista.org/blog/ryzen-blender-rendering-windows-vs-linux/
THIS JUST IN: SHITTY AMD SHIT
THIS JUST IN: SHITTY AMD SHIT IS SHITTY!!
Wowwwwwwww, didn’t see that coming /s
Ryzen is here and you are
Ryzen is here and you are (#゚Д゚)
you are always spoiling for a ლ(`ー´ლ)
and you now just ε=ε=ε=┌(;*´Д`)ノ around
Looking for a ლ(`ー´ლ)
because you where Σ(゜д゜;) and (((( ;゚Д゚))) that
Ryzen’s performance has made folks(^○^) (*^▽^*) (✿◠‿◠)
and now you (≧ロ≦) loudly and expressing exterme (╬ ಠ益ಠ)
because you where so Σ(゜д゜;) and (((( ;゚Д゚))) that
Ryzen’s performance was so \(◎o◎)/
and now you are really ヽ(o`皿′o)ノ and (≧ロ≦) loudly
and spoiling for a ლ(`ー´ლ)
fyi, i linked to this up
fyi, i linked to this up above
https://www.reddit.com/r/Amd/duplicates/5ybrxn/ryzen_7_is_actually_behaving_like_a_dual_4c8t/
i’m curious who really needs Ryzen 1700x?
Blender peeps?
People maybe compiling some serious code?
Dedicated db workstations?
Scene release groups encoding?
Gamers that stream themselves while playing, maybe?
Benchmark fags?
Any thing else i’m missing?
I know i’ve gotten by using a atom netbook to do html/javascript and photoshop. that netbook could play the original starcraft just fine to get me by.
i have an i7 because i was future proofing like years ago. Still looks like i should have just saved $30 and got the i5.
the logic to buy what you need for that moment seems to hold true for everything except maybe console refresh transitions.
Cant generalize.
I still use
Cant generalize.
I still use my Q6600 PC from 2007.
I went with a cheaper quad core VS a faster dual core.
In 2007 the $250 Q6600 was no match in games compared to a $1000 X6800.. But pretty quickly the Q6600 became the better performer.
Also at the time 3.2ghz was fine for gaming. So the Q6600 played all the games.
The exact same is true today. For $330 I would go with R7 1700 vs a $350 i7-7700k (and I did) I as plan to keep this PC for 5+ years.
Hopefully it will be another Q6600, and I will use it dayli for a decade . (The Q6600 was hard to replace because it runs 99% of the games/app I have beautifully)
I don’t know why people
I don’t know why people latched onto it being an SMT problem. Both Intel and AMD have SMT and it seems to be functionally very similar. The difference is the separate core complexes that AMD uses. Intel obviously doesn’t use separate complexes, and they probably pay a latency penalty for that. The communication between cores in a many core Intel part will be slower than communication between cores in a tightly coupled CCX, but going off the CCX will incur increased latency. Intel also does not scale the memory bandwidth with core count either. With AMD’s design, they can scale the memory bandwidth with the number of cores, since it looks like each die adds another dual channel memory controller. It is unclear how the memory controllers are connected to the rest of the system though. The L3 caches and memory controllers may have their own ports on the fabric router to allow for cache coherency. The AMD system architecture seems much more scalable, although there will be some cost due to NUMA overhead. Also, AMD can use cheap small die parts rather than expensive monolithic die parts, which should make them much cheaper to make. This is looking like the situation with Opteron all over again. The Opteron processor brought on-die memory controllers and point to point processor links while Intel was still using a shared bus with the single memory controller on the chipset. Intel eventually went the same route. Intel has stayed with large monolithic die with large L3 caches to make up for the limited memory system for a long time because the profit margins on such things are enormous.
Now I am really curious as to how AMD’s fabric actually works. Looking at the die photos I have seen of Ryzen, it looks like it has a lot more un-core stuff than Intel 4-core parts. In fact, it may have more un-core area than Intel Xeon die photos I have seen. Hard to tell without doing precise measurements. Unless the die photos I have seen are actually Naples parts, I am wondering if AMD is really only making a single die variant. With possibly low yields on 14 nm, it may make some sense. They could just be selling the parts with defective links as the current Ryzen parts while stockpiling fully functional die for server parts.
I don’t know how AMD’s fabric works. It isn’t much of a stretch that they could be using configurable high speed links. Any current high speed interconnect is going to be very similar, if not the same, as the PCI express physical layer, so the links used for interprocessor communication could just be configurable as either PCI express or as an inter-processor links depending on what protocol is enabled.
I haven’t figured out how the connections would work yet. I suspect that the Ryzen die has a lot of high speed links which are not routed into the package on consumer level parts. I don’t know how many it would take. It may have 32 or more links for inter-processor communications in addition to those routed for IO on consumer parts. In server parts, those links may be configurable as PCI express or as interprocessor links.
Go read Charlie D’s
Go read Charlie D’s assessment of the AMD Infinity Fabric over at S/A because AMD’s not going doing any deep dives into its Infinity Fabric IP until both Zen/Naples and Radeon/Vega are to market. There is still some NDAs in effect in advance of Zen/Naples and Radeon/Vega actually being released so AMD is all Johnny Tightlips until then!
There just doesn’t seem to be
There just doesn’t seem to be much info available yet. It seems like something more would have leaked out. The whole thing is kind of reminding me of the planned Alpha EV8 (21464) processor that was never made, but the original K8/Opteron was somewhat based on the same principles also. In the proposed EV8 design, the caches, CPU cores, and memory were all connected to an on die router which was to support routing between up to 512 processors. That was way back in about 2001 though. I don’t know if Jim Keller played a significant roll in that design.
Anyway, the high IO bandwidth isn’t that surprising. Such distributed systems have massive benefits in that reguard. Even without taking processor interconnect into account, each of the 4 die would probably still have the x16 IO PCI-e lanes of normal Ryzen die. That is 64 lanes right there. I have some ideas on how they may be connecting the links between processors, but that doesn’t give me any insight into how that connectivity is handled on die. It does sound like they are possibly running multiple protocols over pci express physical layers though. No need to reinvent the wheel.
Alpha EV8 (21464) processor
Alpha EV8 (21464) processor was the first microprocessor with SMT, thaks a lot for that HP, you and Intel with your Itanium fiasco!
Look at the 21464’s specs(and see where Intel’s SMT(HT) came from with some minor GIMPING on Intel’s part:
“The microprocessor was an eight-issue superscalar design with out-of-order execution, four-way SMT and a deep pipeline. It fetches 16 instructions from a 64 KB two-way set-associative instruction cache. The branch predictor then selected the “good” instructions and entered them into a collapsing buffer. (This allowed for a fetch bandwidth of up to 16 instructions per cycle, depending on the taken branch density.) The front-end had significantly more stages than previous Alpha implementation and as a result, the 21464 had a significant minimum branch misprediction penalty of 14 cycles.[4] The microprocessor used an advanced branch prediction algorithm to minimize these costly penalties.”(1)
(1)
“Alpha 21464”
https://en.wikipedia.org/wiki/Alpha_21464
> Both Intel and AMD have SMT
> Both Intel and AMD have SMT and it seems to be functionally very similar.
Totally untrue. Can’t find the ref right now, but one site did compare games with SMT off for both AMD and Intel, and AMD got a big performance boost in some games. AMD differences were much higher.
That is why people have been blaming the Windows 10 scheduler (especially since on Windows 7 Ryzen behaved much closer to the Intel behaviour).
It wouldn’t be surprising if
It wouldn’t be surprising if Intel’s SMT implementation (which has been tweaked through many processor generations) is a bit better than AMD’s first generation implementation. That doesn’t change the fact that the scheduling for SMT doesn’t need to be any different. It will still prefer to use one thread per physical core first before loading multiple cores. This whole article was about how the SMT, and scheduling for it, is not being handled any differently between the two architectures. SMT is not the problem for the scheduler. Also, Intel chips can perform worse for some applications with SMT enabled. We leave it disabled where I work since the applications we use do not share cache well; it performs better with it off.
It wouldn’t be surprising if
It wouldn’t be surprising if Intel’s SMT implementation (which has been tweaked through many processor generations) is a bit better than AMD’s first generation implementation. That doesn’t change the fact that the scheduling for SMT doesn’t need to be any different. It will still prefer to use one thread per physical core first before loading multiple cores. This whole article was about how the SMT, and scheduling for it, is not being handled any differently between the two architectures. SMT is not the problem for the scheduler. Also, Intel chips can perform worse for some applications with SMT enabled. We leave it disabled where I work since the applications we use do not share cache well; it performs better with it off.
how about game developers was
how about game developers was the problem?
since RyZen is a new overall arch from AMD an its looklike almost all game developers working with intel CPU maybe..
any game tested for these scheduler things?
If Windows 10 does not have a
If Windows 10 does not have a scheduling problem with ryzen, then why are Microsoft preparing a patch for it.
The Windows 10 scheduler
The Windows 10 scheduler works just fine, it works as it was designed to work. It’s just that the current Windows 10 scheduler is not updated/optimized to work efficiently on Ryzen/CCX units for Gaming workloads that are sensitive to latency.
The problem is that the windows 10 scheduler needs an update and that patch may help some! But all the extra telemetry workloads and ad pushing workloads and such does not help matters at all when that cached data/code gets fragmented(across other cores’ cahes) too far from the core(and its two processor threads) that need that cached code/data. WTF M$ and AMD! Get to coding!
Allyn, I agree with you with
Allyn, I agree with you with all of your article, but one aspect. The win 10 scheduler is ALMOST OK. The l3 cache should be modified from 16MB to 8 MB. Keeping it at 16 MB will force cores from one CCX to look for data in second CCX L3 cache and will be a penalty in latency.
RYZEN R7 is a dual quad core, a Core2Quad in AMD vision(just core2quad is a quad processor not octa core and core coherency was made by FSB instead of fabric interconect). Ryzen R7 should be seen as a dual socket system
Wait for AMD to properly
Wait for AMD to properly describe their Infinity Fabric IP, they will not talk about that IP until the Zen/Naples and Radeon/Vega products are fully RTM! NDAs are still in effect!