Nehalem Design Goals

Intel shared more on the upcoming Intel Nehalem architecture last week at IDF and we have detailed all the nuts and bolts for you here in an easy to digest manner. Come in and see why Nehalem, with its on-die memory controller, integrated power control logic and new Turbo Mode will make whatever you have now obsolete.
Re-introduction to Nehalem

Last week was the final Intel Developer Form before the release of new Nehalem processor and with that expo came the final pieces of the hardware puzzle.  The architecture has been explained and discussed since early 2007 in a general sense but last week we finally got the details on some key architectural improvements that help Nehalem stand out from both Intel’s and AMD’s current generation lineup. 

Previous to this piece, I had written not one, not two but three rough draft analysis of the Nehalem core. (Here, here and here.)  Some of what we cover today in this article will be a repeat of those details but even I could use a refresher course since the early March 2007 briefings. 

Intel’s Design Goals

Since introducing the “tick-tock” method of processor design several generations ago Intel has really impressed me with their ability to layout a roadmap years in advance and hit the dates and performance targets nearly dead on.  The “tock” of this design mentality is a new microarchitecture (like Merom) while the “tick” is an upgraded process technology (like the move from 65nm to 45nm with Penryn).  Nehalem will be the next “tock” on this scale followed by a 32nm reduced version called Westmere.

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 29

Intel has already laid out the next “tock” as well; called Sandy Bridge Intel is keeping mostly mum about the features and details of this chip until next year sometime. 

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 30

Here you can see a die shot of the new Nehalem processor – in this iteration a four core design with two separate QPI links and large L3 cache in relation to the rest of the chip.  The primary goal of Nehalem was to take the big performance advantages that the Core 2 CPUs have and modularize them.  Now with the Nehalem design, which will be branded as the Intel Core i7, Intel can easily create a range of processors from 1 core to 8 cores depending the application and market demands.  Eight core CPUs will be found in servers while you’ll find dual core machines in the mobile market several months after the initial desktop introduction.  QPI (Quick Path Interlink) channels can also vary in order improve CPU-to-CPU communication. 

The current Intel flagship CPU, the Core 2 Duo/Quad design, is still quite the performer.  It introduced a 4-wide execution engine and SSE4.1 instructions that added 128-bit wide instruction support.  Smart Cache and Smart Memory Access were marketing names given to better caching systems and protocols that improved performance marginally over the previous design.

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 31

At a high level the Nehalem core adds some key features to the processor designs we currently have with Penryn.  SSE instructions get the bump to a 4.2 revision, better branch prediction and pre-fetch algorithms and simultaneous multi-threading (SMT) makes a return after a brief hiatus with the NetBurst architecture. 

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 32

When glanced at from a purely block diagram status, here is what Intel’s Nehalem architecture has to offer.  We will walk through most of these features and specifications on the following pages. 

Nehalem Decode Engine

The first section of the Nehalem architecture includes the fetch and decode operations as well as the first layer of cache and is dubbed the “front-end” of the design.  This part of the processor is responsible for creating the operands for the compute engine to crunch on while performing effective branch prediction.  New in the Nehalem design are updated macrofusion techniques and a loop stream detector. 

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 33

Macrofusion is a technique introduced with the Core 2 design that combines specific instructions for faster execution and better efficiency.  This was only possible in 32-bit mode before but with Nehalem the benefit will apply to 64-bt systems as well.  The loop stream detector, while not new, has been improved by including the instruction decode step in the detection.  This allows the feature to prevent as many as 28 micro-ops from being run. 

Inside the Nehalem: Intel's New Core i7 Microarchitecture - Processors 34

The branch prediction unit has also seen some improvements – the examples Intel offered up include an L2 addition for larger code sizes and renamed stacked buffers. 

« PreviousNext »