Intel Speed Shift Technology
Power Reduction and Feature Changes
Because Skylake is built on the same 14nm process technology as Broadwell, the team needed to dive into microarchitecture and design innovations to make any improvements. Intel claims that every part of the interconnect, including IPs, I/O, PLLs and more, had power reduction work done on them individually. In particular, video playback and multimedia (capture, RealSense cameras) were targeted for more substantial and more direct power improvements.
Another example is for higher resolution displays. In the move from 2560×1440 to 3840×2160 there is a 1.6x pixel increase but Intel was able handle that change with only a 1.2x in power.
Easily the most interesting new feature in terms of power is called Intel Speed Shift Technology. This feature actually moves much of the control of P-states (performance states) from the operating system to the architecture itself. P-states are what tells the CPU to move between frequencies in order to balance performance and power consumption. In previous designs, Windows and other operating systems would perform the actual state changes. With Speed Shift, Intel is able to directly change the P-states on the processor and this results in a 30x improvement in the speed of that transition.
Why is this useful? First, the speed improvement in that transition should result in added “snappiness” in areas where the frequency needs to increase quickly as a result of user interaction or application need, lowering the apparent latency of some actions. Also, this gives the Skylake processors the ability to manage things like low residency workloads better. Take video recording as a good example of this type workload. Traditionally, a CPU would increase frequency to get through a set of work as quickly as possible to get to idle as fast as possible. For applications that run consistent and repeated, but non-demanding, workloads it might be more efficient to keep the CPU at a slightly higher frequency the entire time rather than spiking up and down repeatedly. Intel Speed Shift gives Skylake that capability.
There are some caveats of course – this only works with Windows 10 today as it requires some unique communication between the processor and OS. For older operating systems like Windows 8 or even other OS paths like Linux, Speed Shift won’t work out of the box. Intel says they have started engaging with the open source community to integrate support for it, which is great, but until then you’ll essentially be reverting to legacy P-state controls on Skylake hardware. Also, Intel's engineers told us that Speed Shift works within a "window" of OS-based performance states so it seems that Skylake does not have complete autonomy when it comes to selecting core frequencies.
Duty Cycle Control makes a return with Skylake as well allowing the cores to emulate lower clock speeds that that lowest P-state offers through enabling and disabling the core as necessary. This is kind of an inverse method to the Speed Shift Technology noted above but can be useful to reduce leakage on very low power envelope designs.
Skylake also includes new Speed Step Technology for new domains including the system agent, DDR and eDRAM I/O. This helps to provide better performance on low bandwidth usage workloads by increasing the clock speeds on the cores and graphics when it’s available.
An Embedded ISP
For the first time in a mainstream processor, Intel is integrated another co-processor for image signal processing (ISP).
This integration is a complete imaging and camera solution with full hardware and software integration. With support for up to 4 cameras (but only two at the same time), it’s clear that Intel is dedicated to the idea of RealSense technology for future devices and user interaction. Camera sensors up to 13MP are supported across the board and Intel is working with the ecosystem to improve time to market and lower engineering costs.
The ISP integration itself takes the input that flows from the CSI (camera sensor interface), through the camera control chipset and helps to enable advanced imaging technologies like face detection, multi-stream capture, HDR and low light capture, burst, and more. And of course, embedding it in the chip means that it is lower power and higher efficiency. Intel promises that the ISP will enabled “zero shutter lag” and 4K/30Hz capture capability.
Intel Software Guard Extensions (SGX)
Intel SGX is a new set of extensions for IA that enable secure computing environments called enclaves. These instructions allow an application to be opened in an environment with application level trusted execution environments, or TEE. This will allow any application to keep a secret, whether that be code, data or both.
The benefits of this are obvious: protection from software attacks including even kernel software. But SGX also enables hardware attacks with protected DRAM. The protected “secret” data, whatever the application has deemed that to be, cannot even be viewed using processor debug tools like ITP.
Intel Skylake Graphics Technology
For more information on the graphics implementation in Skylake, be sure to check out my earlier dive into it.
Closing Thoughts
Honestly, there is still a lot we don’t know about Skylake. Even the basic information about total width of the execution cores seems to be a closely held bit of information the Intel is unwilling to openly discuss. Still, it is clear from what we did learn this week at IDF that Skylake is more than just a simple refresh in the Intel Core architecture and the engineers responsible for it have done an amazing job of balancing power, form factor and performance in ways unseen from any other design from any other vendor.
If you wanted to find out the SKUs and product specific information about Skylake, check back with me here in very early September.
I can’t read the second page,
I can’t read the second page, it says:
You are not authorized to access this page.
Fixed that for you, sorry
Fixed that for you, sorry about that
No problem. I just wanted to
No problem. I just wanted to let you know.
OK the reorder buffer is a
OK the reorder buffer is a little larger from 192, to 224 as are some of the other metrics, but what about instruction decoder counts, and execution pipeline counts. The ring BUS improvements will help some for heavy loads, but I’ll bet that the ring BUS improvements will help more for SKUs with maybe 6, or 8 cores and will help less dramatically for 4 cores or less.
And what are the the improvements in the GPGPU abilities of Intel’s GPU EUs, compared to AMDs ACEs or Nvidia CUs. Does Intel have comparable asynchronous GPU resources compared to AMD or Nvidia. I’d Like to see a more direct comparison and contrast among Intel’s, AMD’s, and Nvidia’s GPU cores/EUs, and that includes their use for GPGPU workloads, as well as graphics workloads. GPUs that simply are only able to be fed Kernels from the CPU are not going to be competitive going forward for GPGPU and Graphics, it’s the GPUs that are able to run and dispatch their own Kernels, while are also able to send workloads back to the CPU that are going to be more useful, especially where latencies are concerned, among many other factors.
In non-L4 Broadwell chips,
In non-L4 Broadwell chips, there’s a full 2MB of LLC, right? The wikipedia page doesn’t seem to touch on this point.
If that’s so, then in Broadwell, only chips with L4 paid the LLC cost for the L4 tags. With Skylake, each core gets its full 2MB of LLC, but all chips have to pay the cost of the L4 tags–not just the chips that have it.
Great, not only do I have to pay for that 40% of the chip doing graphics I don’t want, I have to pay for the L4 tag which I won’t be using.
You do not pay for it if you
You do not pay for it if you are buying a product without EDRAM.
No, the L4 tags are on all
No, the L4 tags are on all chips. Last generation the L4 tags were a configurable portion of the LLC–so you lost LLC in the L4 variants of the processor, but not otherwise.
Now, you pay for the L4 tags on every chip–but he L4 equiped chips get to keep their whole LLC.
can you see the L4 tags on
can you see the L4 tags on the system agent in the released die shot ? I can’t.
It’s likely that they have a different system agent (with L4 tags and eDRAM memory controller) for the versions that need it.
Actually, yeah, I can see it.
Actually, yeah, I can see it. A quick google search linked to a picture at WWFC tech:
http://cdn.wccftech.com/wp-content/uploads/2015/08/Intel-Core-i7-6700K-Block-Diagram.png
Look in the system agent where it says “& I/O controllers”. The block that has “& I/O controll” in it. The “ers” is outside of the block. That’s L4 tags.
They almost certainly had
They almost certainly had specialized hardware embedded in the L3 to support storing tag data there. This hardware now moves to the memory side witch probably makes the L3 cache and the L4 access hardware simpler overall. The previous L4 (eDRAM) just acted as a victim buffer for the L3. The new L4 eDRAM cache probably just acts a simpler cache. With how large it is, I wouldn’t think it would need to be exclusive. Anyone know whether Intel’s L3 cache is physically or virtually addressed? The eDRAM cache can be simplified significantly since it is on the memory side.
I was wondering if they made
I was wondering if they made the block size larger. When it was a victim buffer for the L3, I would think that would have forced them to use a cache line the same size as the L3 cache line. Since it is between the system agent and the memory controller now, they can use any size line that they want, although it should be some multiple of the L3 line size. For graphics work loads, it is probably best to go with a larger size. At 14 nm, size of the L4 tags probably isn’t that important. I don’t really see why people complain about wasting die space. When it comes to Intel chips, even if they were smaller, Intel wouldn’t have much of a reason to charge less. Prices are not strictly based on die size, prices are what people will pay. I think Intel makes a healthy margin considering their profits. Going forward, the on-die gpu may actually be useful even if you are running a dedicated card.
This is basically a
This is basically a regurgitation of the slides. Thanks.
“There are some caveats of
“There are some caveats of course – this only works with Windows 10 today as it requires some unique communication between the processor and OS.”
Infuriating.
Thanks for the early writeup.
“Another example is for
“Another example is for higher resolution displays. In the move from 2560×1440 to 3840×2160 there is a 1.6x pixel increase but Intel was able handle that change with only a 1.2x in power.”
How do you get 1.6x?
4k = 8294400 pixels
1440p =3686400 pixels
That’s 2.25X
I’m not seeing a lot nor
I’m not seeing a lot nor hearing a lot that tells me Skylake has serious improvements for the server side of things. That feels like a marketing stretch.
I would think most of these
I would think most of these improvements would make a bigger difference on server applications than on consumer level applications. Mis-predicted branches are much more common in server code than consumer code. Increasing the out-of-order window along with a lot of other buffers should also benefit server code more, especially with hyper-threading. A lot of those resources are split in half when two threads a competing. Increasing those resources may help quite a bit. Anyway, servers seem to mostly be mentioned in regard to the chips ability to scale all the way from very low power up to high power and performance.
Intel has been designing
Intel has been designing mainly for the server room for decades, and then it derives its consumer parts from the server designs. Intel has been adding some specialized consumer based IP and on Die functional blocks to its consumer SKUs but the base micro-architecture is usually the same top to bottom on it’s server SKUs and PC/laptop SKUs. What Intel lacks, is its own RISC design, or a RISC design that has any market share Intel had the i960 and newer variants, but Intel discontinued its RISC product line(1).
Intel is too late for that market now that the ARM based makers have the lead, and it’s not that Intel could not spend the funds and revive it’s RISC product line, its that the cost of developing a software ecosystem around a custom Intel RISC SKU would be to costly for even Intel to shoulder. The ARM based software ecosystem took decades to develop, and is still being refined, and those development costs are spread across and entire group of companies that make up the ARM ecosystem market, and that includes the ARM hardware market as well with some companies spending billions designing their own custom microarchitectures that are engineered to run the ARMv8a ISA, as well as other ARM ISAs.
Mature software ecosystems can cost trillions over the years to develop and maintain. Intel is trying to break into a market that already has a ready made ISA(ARM Based), and actually came up from the very same devices market, that all the marketing mavens are calling the IOT market.
Intel is too far behind the curve for that market(mobile), and its current financials surrounding its contra-revenue losses have even been hidden by combining a money loosing division into a more profitable division to mask the losses that still continue.
Intel better start paying close attention to the HSA designs of not only AMD, but those of the entire HSA foundation’s members, made up of many of the ARM markets big and small players. That continued movement towards doing more general purpose calculations on the GPU’s cores could put Intel at a serious disadvantage as the GPUs of both AMD and Nvidia acquire more CPU types of abilities, and that includes the PowerVR mobile only GPUs, and the ARM GPUs likewise! When Intel finds the its GPUs are only able to run the Kernels that its CPUs dispatch to them, while AMD and Nvidia’s GPUs can perform context switching and decision making on their kernels without any CPU feeding them, and even dispatch work back to the CPU for the CPU to continue processing on the results, Intel will be in serious trouble. There are loads of graphics and physics workloads, as well as ray tracing workloads, that even Intel’s Xeon processors take hours to compute compared to the work that AMD’s AECs and Nvidia’s CUs can do in minutes not hours on those massively parallel vector units, ditto for the Mobile processors and their more HSA aware GPU hardware.
CPU compute only is not going to compete even for general purpose compute workloads going forward as more work is offloaded to the GPU! Just look at LibreOffice 5.0 and OpenCl, and the work that can be offloaded to the GPU, things are quite a bit faster with the GPU accelerating the calculations. The HSA aware software is catching up with the HSA aware hardware, and those hours long workloads are only taking minuets on the GPU.
(1)
https://en.wikipedia.org/wiki/Intel_i960
I’m not seeing a lot nor
I’m not seeing a lot nor hearing a lot that tells me Skylake has serious improvements for the server side of things. That feels like a marketing stretch.