WCCFTech found some rumors (scroll down near the bottom of the linked article) about AMD’s upcoming EPYC “Rome” generation of EPYC server processors. The main point is that users will be able to buy up to 64 cores (128 threads) on a single packaged processor. This increase in core count will likely be due to the process node shrink, from 14nm down to GlobalFoundries’ 7nm. This is not the same as the upcoming second-generation Zen processors, which are built on 12nm and expected to ship in a few months.
Rome is probably not coming until 2019.
But when it does… up to 128 threads. Also, if I’m understanding WCCFTech’s post correctly, AMD will produce two different dies for this product line. One design will have 12 cores per die (x4 for 48 cores per package) and the other will have 16 cores per die (x4 for 64 cores per package). The reason why this is interesting is because AMD is, apparently, expecting to sell enough volume to warrant multiple chip designs, rather than just making a flagship and filling in SKUs with bin sorting and cutting off the cores that require abnormally high voltage for a given clock rate as parts with lesser core count. (That will happen too, as usual, but from two different intended designs instead of just the flagship.)
If it works out as AMD plans, this could be an opportunity to acquire prime market share away from Intel and their Xeon processors. The second chip might let them get into second-tier servers with an even more cost-efficient part, because a 12-core die will bin better than a 16-core one and, as mentioned, yield more from a wafer anyway.
Again, this is a common practice from a technical standpoint; the interesting part is that it could work out well for AMD from a strategic perspective. The timing and market might be right for EPYC in various classes of high-end servers.
Personally, I think the 12
Personally, I think the 12 core dies is Zen2 and 16 core dies is Zen3.
Current zeppelin die is less
Current zeppelin die is less than 200 square mm. Even doubling the CCX count will not double the die size. I would have to spend some time with a die photo to figure an estimate. Intel is doing up to 28 core on 14 nm, so a 16 core should be easily doable, especially on 12 nm or smaller. I don’t know if I believe 7 nm in 2019. The actual 7 nm figure probably doesn’t really refer to anything real though, so they could always have a process that they are calling 7 no in 2019. Also, it is possible that this is going to be Zen2 based; it is supposed to arrive in 2019. Server/workstation/HPC processors tend to lag a little behind desktop processors though, so who knows. I wouldn’t expect them to jump to 7 nm right away for an enterprise part.
anything 7nm will be really
anything 7nm will be really low clocked. everybdoy is having trouble clocking high below 14-12nm. i dont know if it will be great for desktop unless you really have well multithreaded apps. i would think 3.5 would be a good overclock on these. so mostly i think it would be a server chip. 12nm for all future desktop chips for the next couple0 years anyway. they could do the 12 core chip on 12nm though and have a larger 12nm 12 core for desktop while also using it for threadripper and stilll get the high clocks while also having the old 8 core zepplin in limited runs possibly. or get rid of it all together and just upp the cores are each product segment. on server they usually run about half the desktop speed anyway so clock wont matter.
For Zen 2, I wouldn’t really
For Zen 2, I wouldn’t really be surprised if they make 4, 6, and 8 core variants for maximum flexibility. If higher clock speed is harder to do on 7 nm, then they could skew the process tech towards higher clock for the 4 core variants or skew the 4 core layout for higher clock speed (wider spacing between components or whatever else can be used to increase clock speed). Current zeppelin die are skewed more towards high density (GPU like process/layout), which is a good trade of since most systems are power limited to lower clock anyway.
The 16 cores might be way low
The 16 cores might be way low clocked for thermals and cost?
And will this lead to single ccx ryzen 5/7?
GloFo’s 7nm is reported to
GloFo’s 7nm is reported to have dramatically improved power consumption over 14nm, so clocks could well be similar to present chips, with double the cores, for the same TDP.
The rumours of CCX designs with 6 and 8 cores are interesting. I was expecting a 3 or 4 CCX chip, but a larger CCX is better for most users.
Rome will still be a 4-die package, so this means 12/16-core Ryzen 7 and 24/32-core Threadripper. And a possibility for 6/8-core APUs.
The CCX with 6 & 8 cores are
The CCX with 6 & 8 cores are harder than just adding more CCX’s…
Adding more cores means the need for more intracores links, in a 4 Cores scenario, you need three links in each core, as each core will need to connect to 3 other cores.
If the CCX has 6 Cores then each core will need 5 links, if it has 8 Cores then 7 links will be needed..
This is just the Core links, this is why most 8+ cores design moved already to a ring or matrix because these core-to-core links becomes very complicated and brings major issues with the die design..
In the other hand, if they just added another CCX, nothing more, the IF behave like adding a PCIe card in the mix. no more complications, from design perspective almost nothing change it’s a matter of elements/parts re arrangements to best fit the new addition with the best die proportion to get most dies from a single wafer without lose.
While I admit, having a 6C or 8C CCX is more pro than connecting more CCX using IF as the later adds latency, but seeing how things went with Ryzen with 2x CCX proves the drawbacks are worth it.
Need 16 Cores ?, no problem add 2x CCX’s and that’s it.
Personally, the only thing I missed from current Ryzen is lack of 8C/16T Mobile specific chip.. AMD currently doesn’t have such chip, for mobile the max you get is 4C/8T, and all 8C/16T CPU’s are 65W+ TDP and are not designed for mobile in the first place…
“most 8+ cores design moved
“most 8+ cores design moved already to a ring or matrix because these core-to-core links becomes very complicated and brings major issues with the die design..
In the other hand, if they just added another CCX, nothing more, the IF behave like adding a PCIe card in the mix. no more complications, from design perspective almost nothing change ”
Exactly, well put, I agree.
Its inherent to the whole Zen architecture that the ccx be 4 core. Beyond that, IF is pretty damn flexible as you say.
My layman’s guess is they doubled up on the zeppelin die .
Its new equivalent will increase to 4x 4 core ccxS, doubling the cores per socket & adding least latency.
My guess is that that is most
My guess is that that is most likely, but there are persistent rumours that the core count per CCX is what has changed.
We’ll see soon enough.
6/8 core apus would kill as
6/8 core apus would kill as business machines/light workstation at Costco or Best Buy. Especially if they can pack in more than 11 CU. They have a Dell with a R7 1700 and a 1050 ti at my local Costco, if they ditch the GPU thats money that can go into better RAM if they design it right.
Rome is probably not coming
Well, you know the saying: “Rome wasn’t built in a day.”
They also say it was founded
They also say it was founded by 2 wolf brothers… I say its like a nice version of Mexico City.
AAAUUUGH!!!!!
AAAUUUGH!!!!!
AAAUUUGH!!!!!
AAAUUUGH!!!!!
from WFCCTECH…rumor mill
from WFCCTECH…rumor mill central, if you throw enough balls of crap at a wall some are bound to stick..they take rumors from rumors from rumors and often say “from a credible source that wishes to remain anonymous” basically meaning “if we are wrong which we often are in no small portion we do not want to be held accountable legally or otherwise”
Guess pcper wants to follow in anything but clean footsteps as of late.
As for EPYC eating away at Xeon and i7/i9 numbers, they are very much already doing so, AMD via Ryzen/TR have given buyers far more value for every $ spent than Intel has in a very long while (they tend to give minimal amount possible, want premium $$$ and as often as change underwear up and make new sockets for no good reasoning other than $$$$$$$$$$$$$)
AMD themselves have spoken of
AMD themselves have spoken of 48 and 64-core EPYC CPUs coming in the future. The thing we don’t know yet is what the configuration will be – more cores per CCX, more CCXes per die, or more dies per package.
The tricky part about making
The tricky part about making that blanket claim (however much evidence there may be that says that’s the case) is that WCCFTech is such a major hub for leaks and scoop, that the couple guys running it have built up a Rolodex of legit industry sources & a reputation as a safe place to leak to, so it regularly attracts legit major & significant leaks & scoops.
YET at the same time, they’ve got also have a blatant and totally upfront attitude of the fact they’ll post absolutely anything if they believe it appears at all plausible so that it’ll draw clicks; whether they believe it to be true or not (or even outright already know sometimes. This makes it so that for all the legit leaks they get/post, there’s at least just as many in the big pile of BS & rumors surrounding them.
This means that you can just disregard anything they post simply out of the fact it’s gotta be junk bc they post a whole ton of junk, OR bank on it being legit because they also have a ton of legit sources & post a ton of legit leaks & news stories either.
Basically everything on WCCFTech shouldn’t be outright dismissed just bc it’s on WCCFTech, but instead taken when a large grain of salt, with the size of that grain changing depending on the plausibility of whatever it is you happen to be reading.
What about having a new CCX
What about having a new CCX unit with 6 CPU cores per CCX so there remains ony 2 CCX units per die on the 12 core Rome die variants. Maybe a 6 core CCX would have a little higher latency between the 6 core on the same CCX but that may be better than having the higher IF latency for inter-CCX communication on the same die or the even more latency for inter-die communication on the same MCM, or cross socket IF latency.
The L3(8MB total) cache on a current Zen CCX units is currently subdivided a 2MB sub partition per core with every core on the CCX able to access the other cores’ 2MB/L3 cache slices on the same CCX with the same average latency. So would 2 more cores cause the average intra-CCX CPU core L3 cache latency to go up for any core that may need more than its alloted 2MB partition of L3 cache.
I’ll bet that any inter-CCX communication via the IF is going to have more latency that even a higher average intra-CPU core on a CCX unit of 6 Zen cores would incur if a CPU needed an intra-CCX L3 cache transfer between those allocated 2GB per CPU core partitions. So a 6 core CCX with maybe even more than 10MB of L3 and each core not having to go outside of it’s allotted L3 cache slice on a 6 core CCX unit varient. Having 6 cores per die and giving each core more than 2MB of L3 would cut down one some cross L3 cache partition traffic on the same CCX and even will reduce the Overall IF fabric transfers inter-CCX, inter-die, and inter-processor cache-coherency traffic for any ROME SKUs.
A larger L3 cache allotment per CCX would also reduce memory transfer pressure if the 12 core dies on that ROME varient still have to share only 2 memory channels per die and maybe AMD can add one more memory channel per die to releave that by providing 12 memory channels per socket instead of the 8 memory channels per socket that is on the current Epyc SKUs.
Looking at how big that IHS is on Epyc it looks like AMD could fit 6 dies under that instead of just 4 if that 7nm node saves enough die area compared to 14nm. The OS just has to be made aware of the topology and be abe to have that NUMA/UNA mode and the software able to ask for the proper core affinity from the OS so threads are not transfered across CCX boundries while they are still executing workloads. I’d be sure to fatten up the L3 cache and even the L2 a bit at 7nm as larger cache always helps take the pressure off of any higher latency memory accesses.
Expanding the CCX count isn’t
Expanding the CCX count isn’t likely for a whole host of reason, including the increased latency obviously, but far more importantly, the exponentially increased difficulty to maximize thread scheduling to minimize that latency. Coding for a 2-CCX design takes some work (see all the Ryzen optimized game patches in the post launch period, along with pro-software, and most esp. all the various server code currently being optimized by clients expecting the changes to be for the long haul), but is relatively easy all things said and done. 3 CCX’s up’s the difficulty DRAMATICALLY, and with each CCX being tied into a memory controller, would likely require 3 memory channels per die. And those are just the tippy top of the iceberg of reasons it doesn’t make sense. On the other hand, expanding the size of the CCX by 2 & 4 cores respectively [both even numbers, maintaining the required core symmetry the CCX design requires for it’s cache layout] for the 2 suggested dies makes more and more sense the more & deeper you think about it.
For EPYC they don’t care
For EPYC they don’t care much; a two-socket system already has 16 CCXs.
For the desktop, though, a CCX with more cores is a lot more attractive than three or four CCXs.
That’s totally and completely
That’s totally and completely different actually. The effects of the CCX count on die, and the total from however many dies they connect are totally unrelated. The later can be as high as you wanted, as long as you could fit all the Infinity Fabric 256bit bi-directional crossbar buses on the package connecting every die to every other die. They’d have to totally redesign that Infinity Fabric bus for 3 CCX dies as well (in addition to a million billion other changes), likely to a 368bit design, meaning the off-die facing IF bus on chip would need expanded as well, though the 2x intra die IF buses one would need to stay 256bit. This would all just be such a cluster**** nightmare I will literally eat my shorts if these if the route they go. It’s a massive amount of work for literally no reason. Expanding the 2x CCX’s has all the same benefits, but has FAR superior performance vs a 3x design on top of being infinitely simpler to implement.
The individual cores in each
The individual cores in each CCX do not connect to the crossbar directly. AMD was using a crossbar with many ports way back in 2003 with Opteron. It had 3 Hypertransport links, 2 cores which connected directly to the crossbar, and the memory controller. That makes 6 ports in a processor from 2003 made on a 130 nm process. These upcoming processors will be on 12 nm. With Zen, the memory controller just connects to the crossbar switch, not directly to a CCX. Also, each CCX only has a single connection to the crossbar. The individual cores within a CCX essentially communicate through the shared L3 cache and L2 to L2 snoop. The crossbar already has a huge number of ports, although those handling x16 links only need about half the bandwidth of the local memory controller, so they may not all be full width. If all of the links have separate ports, then it could be 12 or more in the current design: 4 x16 IO, 4 x16 interprocessor, 1 on die IO, 1 memory controller, and 2 CCX.
That per CCX unit core to
That per CCX unit core to core L3 cache latency increase on the Intra-core caches connectivity latencty hit is minimal compared to any over the infinity fabric extra steps required for inter-CCX cache coherency traffic or even the inter-die and inter-processor socket IF cache coherency traffic. So at 7nm AMD could create a 6 core CCX unit and keep the Infinity Fabric mostly the same for the rest of the IF ralated assets the same as they currently exist with the first generation Naples/Epyc SKUs. AMD just needs to double the per CPU core’s L3 cache allotment on any 6 core CCX redesign from 2MB per core to 4MB per core for a total of 24MB of on CCX unit L3 cache per CCX unit(48MB per ROME/Zeppelin die) and that would cut down on one CPU core on the CCX unit running out of cache on its L3 allotment/partition and having to use the allotment on another CPU’s L3 cache. That will keep the inter-CPU core cache coherency traffic on the same CCX unit traffic to a minimum as well as any cross-CCX unit cache coherency traffic over the Infinity Fabric.
So as far as redesigning with the least amount of change over the previous generation’s IF topology, creating a larger CCX unit with 6, or 8, cores per CCX would be the least disruptive and doubling the per core L3 cache partition allotment would reduce the amount of cross CPU cache coherency traffic and also reduce the really latency producing main memory accesses. AMD could also at 7nm increase the per core L2 cache also and that would improve performance also.
AMD could also put out an MCM processor with 6 dies per MCM amd create a new MB design to go with that SKU. AMD could increase the core count without having to increase the socket count on some future EPYC designs and keep the motherboard cost down by retaining only 2 sockets while increasing the core count and even per socket memory channel counts. And custom designs for specific cloud servces customers are not uncommon in the server market where if enough cloud services customers sign on with a project if there was enough interest then AMD could create some customized Epyc SKUs/MBs. That big in memory database and AI inferencing business is only going to get larger going forward and AMD needs to being working with all its server/workstation customers on an upgrade path going forward now that AMD is back big time in the server market with its Epyc designs.
AMD needs to look at maybe also getting some on MCM HBM2 stack/s tied into the CPU dies on the MCM and use that like a last level/L4 cache and that would mean that the per socket DIMM based DRAM traffic would be reduced and latency increases from any slower/latency inducing off MCM memory accesses would be reduced also. So maybe a two tier DRAM memory arrangement with HBM2/HBM3 at the top of the pyramid and the slower DIMM based DRAM below all the way down to page swap and hot file storage on the SSD to less active staged file storage on the Hard-Drive.
That is not how the CCXs and
That is not how the CCXs and memory controllers are connected. A memory channel does not connect directly to a CCX. Some of the Epyc processors have completely disabled CCXs. This does not disable a memory controller. The memory controller with 2 channels connects to a crossbar switch. It operates at memory clock, but the memory operates at double data rate, so it has to be twice the width (256 bit) to operate at full bandwidth. AMD has been using a crossbar switch since the original Opteron in 2003. With the Opteron processors, the memory controller, each processor core, and each hypertransport link has a separate connection to the crossbar switch. With Zen, it doesn’t look like each core even connects directly to the crossbar switch. The cores within the CCXs connect via the L3 essentially. Then the CCX has a single connection to the crossbar. The number of CCXs is not dependent on the number of memory channels at all. It is unclear how the external connections connect to the crossbar. It has 4 x16 IO links and 4 x16 interprocessor links and also some on die IO that needs connected. That is possibly a lot of ports already; 4 external IO, 4 Interprocessor, at least one on die IO, 1 memory controller (128 bit), and 2 CCXs which is 12 by my count. Adding 2 more for some more CCXs doesn’t sound like that big of an issue. There is also the possibility that they are already there, just unused. Adding more cores to a be would be a much more major redesign.
Someone over at reddit has an
Someone over at reddit has an embedded link to the ISSCC 2018 slides for Zeppelin. They even have memory latency slides for the different times(ns) it takes to access local memory, memory on the same socket/different die and memory across to the other socket to the unified memory controllers on those dies. And the Infinity Fabric is what every other functional blocks sits atop as far as the uncore functional blocks are concerned, including the unified memory controllers. AMD even has CAKE(Coherent AMD SocKet Extender) baked in. So you can go over to reddit and view the slides:
“[News]ISSCC 2018 slides for Zeppelin: an SoC for Multi-chip Architectures”
https://www.reddit.com/r/Amd/comments/7z6e26/isscc_2018_slides_for_zeppelin_an_soc_for/
AMD still has one big problem
AMD still has one big problem – Intel hardware runs 98% of server farms/the cloud, and won’t easily be ousted. EPYC isn’t cheaper by enough to convince people to abandon Intel’s trusted and proven platforms. I would be very much surprised if AMD got even 10% of that market in the next decade, even with the Spectre/Meltdown fiasco.
I’m going to guess you aren’t
I’m going to guess you aren’t super familiar with the server market? EPYC is crazy appealing, it’s the lack of openly available stock & servers that’s the problem. AMD is doing a long rollout focused first on satisfying it’s primary major partners & their reltively massive scale installations (Microsoft, Baidu, Amazon, etc…) that’s causing that problem.
Why is EPYC so damn appealing you ask? Feature set segmentation. A single socket EYPC server has more I/O & memory capacity, than most all but the most expensive DUAL socket Xeon-SP platforms and about the same amount of memory bandwidth (EPYC = 8-channel, Xeon-SP = 4) because all of Intel’s single socket offerings are incredibly gimped. And even at it’s best with a Xeon Platinum, chip to chip, EPYC still has an outright I/O lane, memory capacity and bandwidth advantage. Basically for any memory heavy workload, any EPYC SKU will be a fraction of the total cost of an Xeon-SP that’s been unlocked enough (at which case you paid for a bunch of cores & crap you don’t need too). Also, EPYC’s MCM design makes it a virtualization MONSTER! Each die can be virtualized and protected by AMD’s special sandboxing encryption, namely each die gets cut off from the others & is in it’s own totally encrypted virtualized system that is literally impossible for any of the others to access it’s data, and vice versa. This is why Microsoft’s moving literally all their servers for VM use on Azure over to EPYC, and I believe Amazon is following suit with AWS. These are just a few of the reasons it’s so desired in many markets (not that it does everything better than Xeon-SP because that’s totally not true either! But there is definitely a ton of unmet demand atm).
Thanks. Its anecdotal posts
Thanks. Its anecdotal posts like yours I find most credible, and thats much what others have said.
I think similar could be said of the vega tie in with epyc for gpu compute. Just pilot projects have saturated their hbm memory supply chain for vega.
You list epyc’s strengths. Reading between the lines, i suspect this is a major strength too – corporates much prefer a good allrounder to standardise on if they can.
That information is not
That information is not anecdotal, and you definitly have an agenda. Keep up the same level of posts and folks can easily see that you have a paymaster and script to follow!
The FUD is strong with you but your gray-matter is severely lacking. Maybe you would be better off on an ESPN sports blog where things are not to complicated for that littel single gray-cell floating in a rather large ocean of lipids encased in that thick layer of bone.
Epyc/SP3 MBs support 8 memory channels per socket and 128 PCIe lanes for both the 1P and 2P Epyc MBs. So right there Intel is behind and AMD’s Epyc product offerings are not as segemented as Intel’s! Que Linus in the Asian rain sitting of that curb face-palming while pondering on Intel’s product segementation schemes($$$$$) with extra $$$$ required for Intel’s RAID Keys and Extra $$$$ required for more PCIe lanes, etc.
Oh and Enterprises do not have to worry about Meltdown on Epyc like they do for Intel’s Kit! Enjoy your Fuckwit Trampoline and don’t bounce too high you’ll hit the ceiling and that one gray cell will get seasick!
Praise Scott Michaud and his
Praise Scott Michaud and his wonderful article. [/`.`]/
It is almost certainly extra
It is almost certainly extra CCXs on a single chip rather than larger CCXs. If they are talking about a separate die for 12 and 16 core, then it doesn’t really make sense for it to be made up of more than 4 core CCXs. If I was larger CCXs, then 12 core would be salvage from 8 core CCX production, not a different die.
They could go with larger CCXs with Zen 2, the next gen part, not Zen+ that releases in April which just seems to be a die shrink mostly. Zen2 going with larger CCXs also seems a bit unlikely, but not impossible. They are trying to get developers to optimize their software for the 4 core CCX. Producing an 8 core would be counter productive to some extent. The CCX architecture AMD has designed has some advantages over the more monolithic design Intel is using.
Interconnect consumes a huge amount of power in modern processors. The ring bus Intel uses is very wide (256 bit?) and it is driving data long distance across the chip at core clock. That takes a huge amount of power. The larger Intel chips had multiple ring busses to help reduce latency because the ring bus architecture doesn’t scale very well to larger numbers of cores. The mesh network on new Xeon processors can provide low latency for a larger number of cores, but it seems to take even more power than the ring bus architecture did. They also shrunk the L3 and increased the L2 caches to make them less dependent on the L3 anyway.
With AMD’s architecture, you get low latency and low power consumption as long as you stay within the CCX, which isn’t that big of limitation with a single CCX supporting 8 threads and 8 MB of L3. It does go as low as 2 core per CCX for some chips, but each gets significantly more L3 cache per core as cores are disabled. Larger number of cores will increase the memory latency if there is a last level unified cache. Intel has attempted to work around it with their mesh network design, but it seems to cost a lot of power to do so. Sharing data across CCXs will be slower, but since it is at lower clock, it also will be lower power. If the software is optimized to take advantage of the lower latency and lower power consumtion available within a CCX, then lower latency is not required between CCXs. With the number of cores AMD is scaling up to, I doubt that there would be a way to provide unified cache without burning a ridiculous amount of power. Moving data is not free, especially at core clock.
Larger CCXs of 6/8 CPU cores
Larger CCXs of 6/8 CPU cores are easy to design while not having to redesign as much the uncore parts of the Zeppelin die or the off die IF interconnects and MB socket. A 6/8 core CCX is very easy to do and at 7nm there will even be room for larger per core L3 cache slices so less stress is placed on the Infinity Fabric through excessive cache coherency traffic and memory traffic. AND can even up the L2 cache for each core and get performance/latency improvments on core, between cores and across the socket beacuse larger caches equate to less IF traffic with more of the needed data likely to be found on the larger L3/L2 cache rather than out on system memory at a higher latency cost.
AMD can also re-layout the Zen2 cores and dies is such a way as to save space with some higher density design libraries and on Carrizo at 28nm AMD was able to get the CPU core area space saved at about 30% all while staying on the 28nm node without needing a die shrink. So for the server market that wants more cores that are run at lower clocks for maximum efficiency more cores on a CCX and no need to update the Epyc/SP3 motherboards to get the extra cores will be popular with the cloud services providors.
My current suspicion is that
My current suspicion is that they may have 4, 6, and 8 core CCX variants for maximum flexibility. They are already using high density design libraries, so that is not new. They are unlikely to go through and do a complete redesign again. Increasing the L3 size seems unlikely also, but that is highly dependent on how it fits into the rest of the system architecture. Intel went with smaller L3 cache per core and increased L2 from 256 KB to 1 MB. AMD is currently at 512 KB L2,with Zen and a relatively large L3 cache slice per core. Cache design is one of the most complex components of modern CPUs, so drastic changes are probably unlikely. The 8 MB per CCX is already quite large, but doubling the core count will put more pressure on it. Increasing L2 size may be a much larger design change. Increasing L3 may not be necessary. If they upgrade the interprocessor links to PCI Express 4.0 speeds or higher, then the off die latency will be much lower. I don’t know if they will have DDR5 support by then, but that could increase speed significantly. Also note that larger caches does not really reduce cache coherency traffic.
64 cores 128 threads!!
64 cores 128 threads!! Awesome computer chess cpu! I’ll buy one!
I wonder how they are
I wonder how they are handling Spectre variants. I have been wondering if they could add an address space identifier to the branch prediction table and have multiple tables available. That would eliminate any threat across processes or across kernel and user space. There is really only one kernel address space so only one table would be needed for kernel space addresses. The branch predictor must be very fast to access though, so I don’t know if there would be time to pick a table and lookup in the table. Also, they may not have been time to add a complicated design change. A single address space identifier could allow some smarter clearing of branch prediction tables and would be a small change.
IO chip is too big for I/O
IO chip is too big for I/O and switching logic only. I bet on a huge pool of L4. A big pool of faster-than-DIMMs memory would definitively help keeping the cores busy in many server workload scenarios.