Nehalem Architecture (cont’d)New cache structure, new L3 cache
The Intel Smart Cache makes a return with the Nehalem core but this time in a 3-level cache hierarchy design. The two first level caches include a 32 KB instruction cache and 32 KB of data cache and the L2 cache is a completely new design compared to the Core 2 CPUs out today. Each core receives 256 KB of unified cache that is 8-way associative that is both low latency (about 10 cycles from load-to-use) and scales well to keep extra load off the L3 cache.
The L3 cache layer is completely new to Intel though AMD’s Barcelona chip introduced a similar design late in 2007. This L3 is an inclusive cache that scales with the number of cores on the processor – quad core processors will have as much as 8MB in 16-way associativity. Any perceived latency on the L3 will depend on the frequency ratio between the core and uncore sections of the CPU – something we haven’t gotten enough information on yet.
Bring out yer’ dead! (front-side bus)
One of the features that Intel HAS been talking about for a while is the move away from the front-side bus architecture and to something called Intel’s Quick Path Interconnect. Previously known only as CSI, common system interface, QuickPath is Intel’s answer to AMD’s HyperTransport technology and it performs a very similar function.
Starting with Nehalem and moving forward Intel’s processors will feature a direct connect architecture that is point to point and will transmit data from socket to socket as well as from the CPU to the chipset all while scaling nicely as the number of CPUs and QPI links goes up. Part of the reason the QPI technology was needed on Nehalem was due to the new integrated memory controller on the processor. As AMD introduced many years ago, an IMC allows for higher peak memory bandwidth and lower memory latency though Intel is taking it another step up by offering a three-channel DDR3 memory controller from each CPU. The QPI is also a requirement of efficient chip-to-chip communications where one CPU might need to access data that is stored in memory on the other processors memory controller.
The QPI design supports 6.4 GigaTransfers a second or 12.8 GB/s of bandwidth in each direction for 25.6 GB/s total bandwidth between two points. Future versions of QPI will scale up to faster speeds as well. You can also tell in the above four-CPU diagram that QPI will scale well with as many as four CPUs – each processor in this case would require four total QPI connections and would be only one hop from any other CPUs memory.
An Integrated Memory Controller, with three channels!
The Intel Nehalem Integrated Memory Controller (IMC) is actually pretty scalable in its own right – besides offering extreme high bandwidth and low latency the number of memory channels can be varied, both buffered and non-buffered memories are supported and memory speeds can be adjusted all based on the market that the processor will be targeted for. Low cost cores with only dual channel memory should cost considerably less than top end three-channel systems.
At launch, the DDR3 memory controller located on Nehalem will only OFFICIALLY support DDR3-1066 memory speeds. While that is pretty lame, I was told on numerous occasions that the memory controller will run at speeds of DDR3-1600-2000 but official supports stops with JEDEC. The IMC in Nehalem will also force Intel to use the NUMA (non-uniform memory access standard) since memory will be stored in different areas (not just attached to the north bridge) for the first time in Intel’s desktop processors.
New Core Power Controls
The Nehalem core also has a new trick in its bag that enables it to lower the power consumption of a core to nearly 0 watts – something that wasn’t possible on previous designs. You can see in the image above what the total power consumption of a core was typically made up of with the Core 2 series of processors – clocks and logic are the majority of it yes, but a third or more is related to leakage of the transistors and was something that couldn’t be turned off in prior designs.
How is this changed with Nehalem? Well with the independent power controller in the PCU and the different power planes that each core rests on, the power consumption for each core is completely independent from the others. You can see in this diagram that though Core 3 is loaded the entire time, both Core 2 and Core 0 are able to power down to practically 0 watts when their work load is complete.
Turbo Mode: free performance?
Perhaps the most interesting bit of news out of Intel’s Nehalem was something called Turbo Mode – a feature directly enabled by the PCU we discussed on the previous page. With modern processors, the debate has raged whether users are better off getting a quad-core CPU at a lower frequency or a dual-core CPU at a higher frequency. Intel is hoping that with Turbo Mode users will get the best of both worlds.
The idea is pretty straight forward: if you have four cores that run at combined power consumption (and heat dissipation) of X, then if you only have two cores loaded (with the other two at idle) then you have additional power headroom to overclock the working cores to a higher frequency.
For enthusiasts and gamers this should been an exciting turn of events. While Intel wasn’t very specific at this point I imagine we’ll see ranges of 200-300 MHz going from the full quad-core clock rate to the a dual-core or single-core (based on idle cores at the time. This means if you purchase a 3.2 GHz Core i7 Nehalem based processor, you will likely see clock rates as high as 3.5 GHz when running single threaded or just dual threaded applications. Gamers should also take note of this!
Intel claims that with the power of the PCU inside the chip the Nehalem core is aware of its surroundings and conditions. If your system is running very cool, say you have water cooling for example, the chip will recognize that it is well under its own TDP and push the clocks even faster. This is possible even while loading all four cores as the above diagram shows. The on-board micro-controller tunes voltages based around a given frequency, operating conditions and specific silicon characteristics. In some ways it appears that the Nehalem core will be able self award enough to find out how far it can be pushed without burning up.