Core and Interconnect
At IDF we finally learned some more about the Skylake core architecture powering the 6th generation processors from Intel.
The Skylake architecture is Intel’s first to get a full release on the desktop in more than two years. While that might not seem like a long time in the grand scheme of technology, for our readers and viewers that is a noticeable change and shift from recent history that Intel has created with the tick-tock model of releases. Yes, Broadwell was released last year and was solid product, but Intel focused almost exclusively on the mobile platforms (notebooks and tablets) with it. Skylake will be much more ubiquitous and much more quickly than even Haswell.
Skylake represents Intel’s most scalable architecture to date. I don’t mean only frequency scaling, though that is an important part of this design, but rather in terms of market segment scaling. Thanks to brilliant engineering and design from Intel’s Israeli group Intel will be launching Skylake designs ranging from 4.5 watt TDP Core M solutions all the way up to the 91 watt desktop processors that we have already reviewed in the Core i7-6700K. That’s a range that we really haven’t seen before and in the past Intel has depended on the Atom architecture to make up ground on the lowest power platforms. While I don’t know for sure if Atom is finally trending towards the dodo once Skylake’s reign is fully implemented, it does make me wonder how much life is left there.
Scalability also refers to the package size – something that ensures that the designs the engineers created can actually be built and run in the platform segments they are targeting. Starting with the desktop designs for LGA platforms (DIY market) that fits on a 1400 mm2 design on the 91 watt TDP implementation Intel is scaling all the way down to 330 mm2 in a BGA1515 package for the 4.5 watt TDP designs. Only with a total product size like that can you hope to get Skylake in a form factor like the Compute Stick – which is exactly what Intel is doing. And note that the smaller packages require the inclusion of the platform IO chip as well, something that H- and S-series CPUs can depend on the motherboard to integrate.
Finally, scalability will also include performance scaling. Clearly the 4.5 watt part will not offer the user the same performance with the same goals as the 91 watt Core i7-6700K. The screen resolution, attached accessories and target applications allow Intel to be selective about how much power they require for each series of Skylake CPUs.
Core Microarchitecture
The fundamental design theory in Skylake is very similar to what exists today in Broadwell and Haswell with a handful of significant and hundreds of minor change that make Skylake a large step ahead of previous designs.
This slide from Julius Mandelblat, Intel Senior Principle Engineer, shows a higher level overview of the entirety of the consumer integration of Skylake. You can see that Intel’s goals included a bigger and wider core design, higher frequency, improved right architecture and fabric design and more options for eDRAM integration. Readers of PC Perspective will already know that Skylake supports both DDR3L and DDR4 memory technologies but the inclusion of the camera ISP is new information for us.
The Skylake Core has had minor changes and nip-tucks done across the board that add up to a significant increase in gen-on-gen IPC improvements. These include things you might normally expect – branch prediction improvements and buffer capacity increases, faster prefetch capability, and deeper out of order buffers for better instruction parallelism. The execution units themselves have also been improved with lower latencies, more units and better power efficiency when not in use. Load and store bandwidth has been increased in the core with deeper store and fill buffers, better page miss handling and L2 cache miss speed ups. Even HyperThreading is slightly improved with a wider retirement protocol.
This table showcases the slight, iterative improvements from Sandy Bridge to Haswell and now to Skylake. Some of these changes are more substantial than you might have expected from previous steps. Inflight stores are increased by 33% and scheduler entries are upped by more than 60%. Individually these changes might not mean much but combined they are show improved parallelism for modern applications and operating systems.
The core architecture also has some improvements for power optimizations including resource configuration that can gate off power hungry AVX2 hardware when it is not in use. Resources that are not being used, in general, have been downscaled somewhat. Scenario based power that might be useful for media playback workloads allow for better mobile platforms; idle power is reduced and C1 state (for low performance requirement workloads) improves dynamic capacitance.
Interconnect and Memory Feature Improvements
Maybe the most impressive changes to Skylake come in the shape of changes to the cache and memory architecture. The throughput on the LLC (last level cache) has been doubled when handling misses. The fabric, part of the ring bus design, has double the available internal bandwidth for moving data from agent to agent without increase power and with only a 50% increase in transistor usage. Memory QoS has been improved to aid in the implementation of the new image signal processor (ISP) and higher resolution displays.
This fabric performance improvement should not be overlooked. With a move to DDR4 memory and changes to the eDRAM, this improvement is directly visible in synthetic testing and could be a way to gain some impressive performance in very specific workloads.
eDRAM performance and usability has been improved as well – it is observed by all memory accesses and is fully coherent. It can cache any data in the processor and there is no longer a need to flush it for maintenance purposes. It can be utilized by I/O devices and the display engine in order to take advantage of low power display refresh capabilities.
eDRAM Integration on Broadwell
In previous iterations of the eDRAM a portion of the LLC (25%) was used to hold the eDRAM access tags and the eDRAM wasn’t able to communicate with the rest of the system directly.
eDRAM Integration on Skylake
Skylake moves the eDRAM controller into the system agent, freeing up 512KB of LLC capacity while also giving other parts of the processor easier access to the data in the eDRAM. That memory can now interface with the main system DRAM directly and enable display refreshes without waking other portions of the processor that might be powered down during idle states.
Unfortunately, even though we are told there are more SKUs coming with the eDRAM integration, there are no plans for Intel to offer a consumer LGA part using it for compute workloads. As we were with Haswell, I am disappointed by that decision.
I can’t read the second page,
I can’t read the second page, it says:
You are not authorized to access this page.
Fixed that for you, sorry
Fixed that for you, sorry about that
No problem. I just wanted to
No problem. I just wanted to let you know.
OK the reorder buffer is a
OK the reorder buffer is a little larger from 192, to 224 as are some of the other metrics, but what about instruction decoder counts, and execution pipeline counts. The ring BUS improvements will help some for heavy loads, but I’ll bet that the ring BUS improvements will help more for SKUs with maybe 6, or 8 cores and will help less dramatically for 4 cores or less.
And what are the the improvements in the GPGPU abilities of Intel’s GPU EUs, compared to AMDs ACEs or Nvidia CUs. Does Intel have comparable asynchronous GPU resources compared to AMD or Nvidia. I’d Like to see a more direct comparison and contrast among Intel’s, AMD’s, and Nvidia’s GPU cores/EUs, and that includes their use for GPGPU workloads, as well as graphics workloads. GPUs that simply are only able to be fed Kernels from the CPU are not going to be competitive going forward for GPGPU and Graphics, it’s the GPUs that are able to run and dispatch their own Kernels, while are also able to send workloads back to the CPU that are going to be more useful, especially where latencies are concerned, among many other factors.
In non-L4 Broadwell chips,
In non-L4 Broadwell chips, there’s a full 2MB of LLC, right? The wikipedia page doesn’t seem to touch on this point.
If that’s so, then in Broadwell, only chips with L4 paid the LLC cost for the L4 tags. With Skylake, each core gets its full 2MB of LLC, but all chips have to pay the cost of the L4 tags–not just the chips that have it.
Great, not only do I have to pay for that 40% of the chip doing graphics I don’t want, I have to pay for the L4 tag which I won’t be using.
You do not pay for it if you
You do not pay for it if you are buying a product without EDRAM.
No, the L4 tags are on all
No, the L4 tags are on all chips. Last generation the L4 tags were a configurable portion of the LLC–so you lost LLC in the L4 variants of the processor, but not otherwise.
Now, you pay for the L4 tags on every chip–but he L4 equiped chips get to keep their whole LLC.
can you see the L4 tags on
can you see the L4 tags on the system agent in the released die shot ? I can’t.
It’s likely that they have a different system agent (with L4 tags and eDRAM memory controller) for the versions that need it.
Actually, yeah, I can see it.
Actually, yeah, I can see it. A quick google search linked to a picture at WWFC tech:
http://cdn.wccftech.com/wp-content/uploads/2015/08/Intel-Core-i7-6700K-Block-Diagram.png
Look in the system agent where it says “& I/O controllers”. The block that has “& I/O controll” in it. The “ers” is outside of the block. That’s L4 tags.
They almost certainly had
They almost certainly had specialized hardware embedded in the L3 to support storing tag data there. This hardware now moves to the memory side witch probably makes the L3 cache and the L4 access hardware simpler overall. The previous L4 (eDRAM) just acted as a victim buffer for the L3. The new L4 eDRAM cache probably just acts a simpler cache. With how large it is, I wouldn’t think it would need to be exclusive. Anyone know whether Intel’s L3 cache is physically or virtually addressed? The eDRAM cache can be simplified significantly since it is on the memory side.
I was wondering if they made
I was wondering if they made the block size larger. When it was a victim buffer for the L3, I would think that would have forced them to use a cache line the same size as the L3 cache line. Since it is between the system agent and the memory controller now, they can use any size line that they want, although it should be some multiple of the L3 line size. For graphics work loads, it is probably best to go with a larger size. At 14 nm, size of the L4 tags probably isn’t that important. I don’t really see why people complain about wasting die space. When it comes to Intel chips, even if they were smaller, Intel wouldn’t have much of a reason to charge less. Prices are not strictly based on die size, prices are what people will pay. I think Intel makes a healthy margin considering their profits. Going forward, the on-die gpu may actually be useful even if you are running a dedicated card.
This is basically a
This is basically a regurgitation of the slides. Thanks.
“There are some caveats of
“There are some caveats of course – this only works with Windows 10 today as it requires some unique communication between the processor and OS.”
Infuriating.
Thanks for the early writeup.
“Another example is for
“Another example is for higher resolution displays. In the move from 2560×1440 to 3840×2160 there is a 1.6x pixel increase but Intel was able handle that change with only a 1.2x in power.”
How do you get 1.6x?
4k = 8294400 pixels
1440p =3686400 pixels
That’s 2.25X
I’m not seeing a lot nor
I’m not seeing a lot nor hearing a lot that tells me Skylake has serious improvements for the server side of things. That feels like a marketing stretch.
I would think most of these
I would think most of these improvements would make a bigger difference on server applications than on consumer level applications. Mis-predicted branches are much more common in server code than consumer code. Increasing the out-of-order window along with a lot of other buffers should also benefit server code more, especially with hyper-threading. A lot of those resources are split in half when two threads a competing. Increasing those resources may help quite a bit. Anyway, servers seem to mostly be mentioned in regard to the chips ability to scale all the way from very low power up to high power and performance.
Intel has been designing
Intel has been designing mainly for the server room for decades, and then it derives its consumer parts from the server designs. Intel has been adding some specialized consumer based IP and on Die functional blocks to its consumer SKUs but the base micro-architecture is usually the same top to bottom on it’s server SKUs and PC/laptop SKUs. What Intel lacks, is its own RISC design, or a RISC design that has any market share Intel had the i960 and newer variants, but Intel discontinued its RISC product line(1).
Intel is too late for that market now that the ARM based makers have the lead, and it’s not that Intel could not spend the funds and revive it’s RISC product line, its that the cost of developing a software ecosystem around a custom Intel RISC SKU would be to costly for even Intel to shoulder. The ARM based software ecosystem took decades to develop, and is still being refined, and those development costs are spread across and entire group of companies that make up the ARM ecosystem market, and that includes the ARM hardware market as well with some companies spending billions designing their own custom microarchitectures that are engineered to run the ARMv8a ISA, as well as other ARM ISAs.
Mature software ecosystems can cost trillions over the years to develop and maintain. Intel is trying to break into a market that already has a ready made ISA(ARM Based), and actually came up from the very same devices market, that all the marketing mavens are calling the IOT market.
Intel is too far behind the curve for that market(mobile), and its current financials surrounding its contra-revenue losses have even been hidden by combining a money loosing division into a more profitable division to mask the losses that still continue.
Intel better start paying close attention to the HSA designs of not only AMD, but those of the entire HSA foundation’s members, made up of many of the ARM markets big and small players. That continued movement towards doing more general purpose calculations on the GPU’s cores could put Intel at a serious disadvantage as the GPUs of both AMD and Nvidia acquire more CPU types of abilities, and that includes the PowerVR mobile only GPUs, and the ARM GPUs likewise! When Intel finds the its GPUs are only able to run the Kernels that its CPUs dispatch to them, while AMD and Nvidia’s GPUs can perform context switching and decision making on their kernels without any CPU feeding them, and even dispatch work back to the CPU for the CPU to continue processing on the results, Intel will be in serious trouble. There are loads of graphics and physics workloads, as well as ray tracing workloads, that even Intel’s Xeon processors take hours to compute compared to the work that AMD’s AECs and Nvidia’s CUs can do in minutes not hours on those massively parallel vector units, ditto for the Mobile processors and their more HSA aware GPU hardware.
CPU compute only is not going to compete even for general purpose compute workloads going forward as more work is offloaded to the GPU! Just look at LibreOffice 5.0 and OpenCl, and the work that can be offloaded to the GPU, things are quite a bit faster with the GPU accelerating the calculations. The HSA aware software is catching up with the HSA aware hardware, and those hours long workloads are only taking minuets on the GPU.
(1)
https://en.wikipedia.org/wiki/Intel_i960
I’m not seeing a lot nor
I’m not seeing a lot nor hearing a lot that tells me Skylake has serious improvements for the server side of things. That feels like a marketing stretch.