Smart Memory Access and Smart Cache

Smart Memory Access

Now, before you start to ask, Intel did not include an integrated memory controller on the Core Architecture for many reasons (they discussed those reasons with me but I was asked to keep this information confidential).  But the new memory access specifications on this design do a good job of hiding the latency that an external memory controller introduces versus an integrated one.

A Detailed Look at Intel's New Core Architecture - Processors 21

The first feature of the smarter memory accessing is memory disambiguation, which is responsible for reordering memory instructions for optimal performance.  While in most cases memory commands were executed in order only, there are times when a memory load that does not affect other data in the current thread would greatly improve performance if it didn’t have to wait for other loads and stores to complete.  In the above slide, the Data X command was previously located as the fourth command to be run.

Intel’s architecture has the ability to determine whether or not this memory load is going to depend on one of the preceding stores (that would essentially change the data this Data X load is going to receive).  If it does not, the architecture can move the command up to improve system performance.  If it does, the architecture has to leave it alone to prevent any kind of data accessing errors.  When this works, it can improve the out-of-order execution speed pretty dramatically and best of all, the results are transparent to the software and don’t require any compiling or coding.

Prefetching is a technique used to predict what data an application is going to need before it asks for it, simply removing or lessening the time required to access the data once the application needs it.  The processor basically pulls data into the cache and hopes that the application asks for it so it can have it on hand right away.  All modern processors have had this kind of logic in them for some time.

A Detailed Look at Intel's New Core Architecture - Processors 22

Intel’s Core Architecture has several prefetchers that are used for various portions of data the CPU might need.  The ‘sun’ icons on the diagram above represent the prefetching units; you can see three on each core and two shared for the L2 cache.  The shared prefetchers can be dynamically reallocated to the core that needs them the most (that is being utilized more) thus Intel is not wasting any die size with logic that might not be utilized all the time.

Interestingly Intel also told us that these prefetch units have virtual ‘knobs’ on them that can be modified slightly during production to fine tune them for specific usage models.  A server chip might have largely different prefetch algorithms than the mobile variant and as such Intel has the ability to play with these features later.  However, the engineer I talked to seemed to indicate that for the most part these algorithms for prefetch will remain mostly static so as to have the industry (mostly compilers) standardize on them. 

Advanced Smart Cache

The cache on the new Core Architecture is shared between the two cores, much like it was on the Core Duo architecture before it.  There have been some various improvements to it that dramatically affect the overall memory subsystem.

A Detailed Look at Intel's New Core Architecture - Processors 23

First, the cache is shared between the cores but it is not statically allocated to either core in any fashion.  The amount of L2 cache that is being controlled by either core can be adjusted dynamically when one core is in need of more of it than the other.  If only a single thread is being executed in the operating system, the primary core can take more of the L2 cache and use it to lower the memory latency hit, thus preventing the ‘cache thrashing’ when cache is full and the CPU has to go to main system memory.

A Detailed Look at Intel's New Core Architecture - Processors 24

Also with this integration of a shared cache, the two cores can very easily share data in multi-threaded applications.  They no longer have to go out onto the front-side bus like on older Intel architectures or over a data crossbar like we see on the AMD Athlon architecture.

« PreviousNext »