Core Enhancements and Cache System
The execution engine of the processor is where the meat of the action takes place; the various stages are responsible for the scheduling and executing of the compute operations and is powered by nearly the same engine that resides in the Core 2 processors you have today.

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 29

The unified reservation station is in reality just a scheduler that matches jobs to execution units; it is “unified” because all of the operations flow through including integer and floating point.  Though a four-wide design, the processor can actually execute 6 operations per cycle including three memory operations (a load, a store address and a store data) while permitting three computations to take place as well.  You can see that each of the six “ports” can perform one of various operations: the first can do ALU and shifts, an FP multiply, a divide or SSE ALU shuffle, but only one.  The last one swaps the multiply option for branching. 

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 30

I mentioned before that Intel is using Nehalem to mark the return of HyperThreading to its bag of weapons in the CPU battle; the process is nearly identical to that of the older NetBurst processors and allows two threads to run on a single CPU core.  But SMT (simultaneous multi-threading) or HyperThreading is also a key to keeping the 4-wide execution engine fed with work and tasks to complete.  With the larger caches and much higher memory bandwidth that the chip provides this is a very important addition.

Intel claims that HyperThreading is an extremely power efficient way to increase performance – it takes up very little die area on Nehalem yet has the potential for great performance gains in certain applications.  This is obviously much more efficient than adding another core to the die but just as obviously has some drawbacks to that method.

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 31

Here you can see Intel’s estimations of how much HyperThreading can help performance in specific applications.  Surprisingly one of the best performers is the 3DMark Vantage CPU test  that simulates AI and physics on the processor whlie POV-Ray 3.7 still sees huge 30% boost in performance for this relatively small cost addition in logic.

Nehalem’s Cache Structure

We have mentioned in previous articles that the cache system in Nehalem is getting a big revamp as well compared to the Core 2 parts today.  This new memory system design (from the cache to the DDR3 memory controller) were all built to feed this incredibly powerful processor with enough low latency, high bandwidth data to help it scale with increased processor core counts. 

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 32

A new term Intel is bringing to world with this modular design is the “uncore” – basically all of the section of the processor that are separate from the cores and their self-contained cache.  Features like the integrated memory controller, QPI links and shared L3 cache fall into the “uncore” category.  All of these components that you see are completely modular; Intel can add cores, QPI links, integrated graphics (coming later in 2009) and even another IMC if they desired. 

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 33

The Intel Smart Cache makes a return with the Nehalem core but this time in a 3-level cache heirarchy design.  The two first level caches include a 32 KB instruciton cache and 32 KB of data cache and the L2 cache is a completely new design compared to the Core 2 CPUs out today.  Each core receives 256 KB of unified cache that is 8-way associative that is both low latency (about 10 cycles from load-to-use) and scales well to keep extra load off the L3 cache.

The L3 cache layer is completely new to Intel though AMD’s Barcelona chip introduced a similar design late in 2007.  This L3 is an inclusive cache that scales with the number of cores on the processor – quad core processors will have as much as 8MB in 16-way associativity.  Any perceived latency on the L3 will depend on the frequency ratio between the core and uncore sections of the CPU – something we haven’t gotten enough information on yet.

An inclusive cache means that all data residing in the L1 or L2 caches MUST also reside in the L3 cache and this is done for better performance rather than pure efficiency with the reduced “snooping” required for core-to-core memory checks. 

« PreviousNext »