There is an insidious latency gap lurking in your computer between your DRAM and your CPUs L3 cache. The size of the latency depends on your processor as not all L3 cache are created equally but regardless there are wasted CPU cycles which could be reclaimed. Piecemakers Technology, the Industrial Technology Research Institute of Taiwan and Intel are on the case, with a project to design something to fit in that niche between the CPU and DRAM. Their prototype Last Level Cache is a chip with 17ns latency which would improve the efficiency at which L3 cache could be filled to pass onto the next level in the CPU. The Register likens it to the way Intel has fit XPoint between the speed of SSDs and DRAM. It will be interesting to see how this finds its way onto the market.
"Jim Handy of Objective Analysis writes about this: "Furthermore, there's a much larger latency gap between the processor's internal Level 3 cache and the system DRAM than there is between any adjacent cache levels.""
Here is some more Tech News from around the web:
- Get this: Tech industry thinks journos are too mean. TOO MEAN?! @ The Register
- Google Releases an AI Tool For Publishers To Spot and Weed Out Toxic Comments @ Slashdot
- Nintendo Switch impressions: Out of the box and into our hands @ Ars Technica
- Galaxy S8+ specs revealed, 10nm Exynos 9 processor confirmed @ The Inquirer
- Ah, the Raspberry Pi 3. So much love. So much power … So turn it into a Windows thin client @ The Register
“there are wasted CPU cycles
“there are wasted CPU cycles which could be reclaimed”
That’s the job of the L2 cache and L1 caches above to hide any latency in the DRAM to L3 transfers so the CPU cores are not starved for instructions to work on. CPUs even have out of order execution and reorder buffers to keep the CPU busy while any high latency transfers happen in the background from DRAM into L3 up into L2 and L1 cashes. There is even SMT to keep the CPU core’s execution pipelines more fully utilized and not having to inject any wasteful NOPs into the pipelines’ execution stages for lack of any useful instructions to work on because of latency issues from any cache levels.
This savings is very nice when you consider the relatively high clocks that most CPUs run at so going from 30ns down to 17ns is definitely going to allow for the L3 to be filled/serviced faster and make main memory accesses less likely to cause any upstream problems to any CPU core/s that may be using/sharing L3 cache. This will greatly help some CPUs with smaller L2/L3 caches to begin with as those CPUs will definitely be making use of more DRAM to L3 transfer calls from the CPU’s memory controller. It will also help with any dynamically changing workloads where non localized code calls may occur more often and require more DRAM transfers to L3 cache. Multi-core CPUs that share L3 can really stress the L3’s ability when there is any excessive latency stemming from DRAM to L3 transfers so 17ns as opposed to 30ns represents a great savings.
Cache efficiency is highly
Cache efficiency is highly dependent on the code that is running. Server code tends to be much less cacheable than most consumer code, so reduction in memory latency could deliver a good performance boost. There is already a lot of stuff in place to deal with latency though, so the benefit could be limited to certain types of applications.
I have wondered if AMD would make a special cache chip for server processors on silicon interposers. With DRAM chips, you never actually read directly from the array. A row of the array is patched into buffers which are essentially SRAM. Access to an open page is really fast l, but once you need to close it to access another page, you incure a lot of latency. The current page has to be written back out, a different paged opened, and latched into the buffer. The latency can be reduced significantly by adding more pages, since it is similar to adding SRAM cache.
Stacked memory on an interposer looks like a good opportunity to build a fast cache chip. The different chips in the stack can be made on processes optimized for them. The DRAM can be made on a process optimized for it. SRAM could be made on an optimized process, and so could the interface die. It would be good to have a large, of die L4 caches. They could use less L3 then. The he large last level caches take a lot of die space and probably reduce yields significantly. With a separate cache chip, they could just make a separate smaller die which could achieve better yields. Although, Even a normal HBM stack acting as a cache could be a big win. The access latency wouldn’t be as low as SRAM, but it could fill cache lines very fast with the huge amount of bandwidth. I believe HBM2 has a larger number of banks or virtual channels (been a while since I read about it), so it should have reduced latency compared to HBM1. I don’t know where the product talked about in this article fits, but HBM based server chips seem to be taking a while, so they may have an opportunity.
That sounds very much like
That sounds very much like the L4 eDRAM cache on some Broadwell and Skylake CPUs.
How about replace DRAM with
How about replace DRAM with this?
Is it possible?
In theory, sure…
In
In theory, sure…
In practice, cost restrictions matter.