Sandy Bridge Architecture Overview
Intel’s newest processor architecture is here, codenamed Sandy Bridge. The first mainstream monolithic CPU/GPU part on the market, the SNB processor lineup impressed us in our testing. We review the Core i7-2600K, Core i5-2500K and 2400 as well as the dual-core Core i3-2100. There are lot of questions answered: is the new processor graphics going to kill cheap discrete cards? Is performance better than Lynnfield? Is the media transcoding technology worthy of the hype?
If you are looking for the Core i7-2820QM mobile review, you can find it here.
The very first mention of the term “Sandy Bridge” on PC Perspective came in a news post that went up on May 6th, 2009. Little was known about the new processor architecture at the time but 18 months later the secrets are all finally revealed as we bring you the introduction of the 2nd generation Intel Core processor family. In total Intel is dropping no less than 29 CPUs covering both desktop and mobile segments, 10 chipsets and 4 wireless connectivity options; a massive roll out of products that we can only really start to dissect in this single review.
Intel is already tracking more than 500 design wins and we expect to see just about all of them this week at CES 2011 in Las Vegas. For now though, let’s look at the processors themselves, the new architectural changes, performance, overclocking and anything else that is going to affect PC builders and enthusiasts.
What makes Sandy Bridge tick (or is it tock)?
At the heart of the design is the microarchitecture based on current 32nm process technology that provides high performance and low power solutions to manufacturing. The primary processing cores of Sandy Bridge do include improvements to increase performance and efficiency starting with a decoded Uop cache that lets the traditional decode front end power down unless needed. Not only does this aid in power consumption, and thus battery life on mobile form factors, but it also decreases front end latency and sustains higher Uop bandwidth. The branch predictor also sees improvement in both overall performance and power efficiency.
To improve instruction level parallelism in Sandy Bridge a new Physical Register File was used instead of a centralized Retirement File that keeps a single copy of every piece of data and doesn’t require movement or shuffling after calculation. This actually allows the processor to have an increase in buffer size of about 33% and is a key enabler of the Intel Advanced Vector Extensions instruction set.
Intel AVX extends the SSE floating point instructions set to 256-bit operand sizes and includes new operations to enhance the vectorization of data. This allows the 2nd Generation Intel Core processor family to potentially double floating point operations per second without an increase in power consumption and in a non-destructive way to previous instruction set implementations. Intel AVX will allow for a simultaneous 256-bit multiply, ADD and load per clock – a great improvement for applications that take advantage of it.
A new memory cluster in the Sandy Bridge architecture services three data accesses (two read requests and one store request) per cycle compared to the two per cycle of the previous generation design. This memory cluster is one of the highest performing features required to keep the Intel AVX instructions fed with data for processing.
Processor graphics addition
The introduction of processor graphics to the Sandy Bridge architecture brought a focus on energy efficient performance and media capabilities. No longer a multi-chip package as we saw in the Nehalem-based CPUs, the 2nd generation Intel Core processor family fully integrates the GPU processing capabilities on a single die with the core microarchitecture.
By having a unified graphics-CPU power management system, Sandy Bridge can make the best decisions as to the power budget allocation across the entire chip. The graphics portion of the processor has also been upgraded with CPU-class power management techniques and with independent graphics and CPU power control overall power delivery can follow workload demands.
The new Sandy Bridge processor graphics balances a combination of fixed function hardware with compute engines for optimal energy efficiency. At each instance in the 3D graphics pipeline where a fixed function unit has traditionally been assumed there is an explicit fixed function block to handle it. This allows for lower latency operations, the best overall throughput per watt and a simpler driver programming model.
For the programmable portions of the pipeline, a new execution unit was built that increases register file size, improves parallel branch prediction to combat deeply nested conditionals and adds a new transcendental math capability that improves performance in those instances by as much as 20x. The result is an execution unit that is about twice as fast than the previous generation and this is good news not only for gaming and graphics, but multimedia as well.
The driver overhead has also been minimized in this generation by removing the orthogonal states that were replaced by fixed function blocks. This means that the driver has significantly less active run time over the previous generations and devices and frees the CPU load so that the power management system can redirect power to the processor graphics frequency.
Because of the shared die, the processor graphics on Sandy Bridge is built on a leading edge 32nm process technology and uses a shared LLC (last level cache) with configurable partitions. This gives the processor graphics higher available bandwidth and lower latency than previous designs and dramatically reduces the DRAM accesses that can slow down processing.
Native parallel computing sees improvements with the Sandy Bridge design including support for infinite nesting of branches, single instruction predication evaluation and a scalar program view that hides parallelism in hardware. Accelerators have been integrated that improve performance on scatter-gather, barriers and atomic operations improving parallel computing performance yet again.
In terms of media processing, the processor graphics solution integrated on Sandy Bridge is a dramatic improvement over previous generations. It provides a combination of both programmable and fixed architecture choices including execution units that are optimized for media workloads. There is native support of many popular mainstream codecs and the parallel engines provide enough bandwidth for high throughput video rendering. Dedicated hardware accelerators will offer extra computing power for HD workloads as well as high quality enhancement and filters.
In terms of a programmable media pipeline the Sandy Bridge processor graphics includes low power integer operations with native support for byte, word and dword instructions in addition to efficient vector/matrix ISAs in cooperation with the hardware accelerators. Intel also includes language support for explicit parallel programming providing better express-ability for programmers than implicit coding. The architecture also supports mixed-kernel programming and thread-to-thread communication and synchronization to allow for a wider range of applicable algorithms.
The fixed function media accelerators in the processor graphics technology including a lot of dedicated video processing units to perform functions like high-quality video scaling, denoise filtering, deinterlacing and detail/edge enhancement filtering. Color processing is also performed in these units and integrates unique features like skin tone enhancement, adaptive contrast and more.
The video unit is a multi-format codec (MFX) that uses the parallel engines as well as a full hardware decode unit for MPEG2, VC1 and AVC formats. This high performance method allows for smoother playback of high bit-rate video while sustaining battery life for mobile form factors. Because of the flexibility of the units included in this engine the MFX can also perform a very fast AVC encode that reuses most of the playback features for a high performance, low power operation. This means that user transcode acceleration will be possible for faster and easier media manipulation.
Power efficiency is very important for graphics technology and media playback and with Sandy Bridge based processors the amount of power required for HD video playback is cut in half. With the implementation of a hurry-up-and-get-idle power saving model and the high throughput parallel decoder this power savings can actually allow for additional headroom for processor core Turbo Boost technology.
System agent and new ring bus architecture
The system agent of the Sandy Bridge architecture includes the functionality of the “uncore” in the Nehalem design but also adds and improves on many aspects of it. This portion of the processor is responsible for the memory controller, PCI Express integration, power management, the last level cache and the completely new ring bus interconnect.
The new ring-based interconnect is used for communication between the processor cores, processor graphics, last level cache and system agent domain. It is composed of four rings: a 32-byte data ring, a request ring, an acknowledge ring and a snoop ring. The ring bus is fully pipelined and runs with relation to the core frequency and voltage meaning that bandwidth can scale as the number of cores does.
There are several key benefits to using a ring bus architecture for the Sandy Bridge processors starting with the fact that it takes up little area on the die; the massive wire routing runs over the last level cache (LLC). The ring is capable of always picking the shortest path between two entry points to minimize latency between communications and the distributed arbitration with the ring protocol handles all the coherency, ordering and interfaces. Also, because the number of stops on the ring can scale up, server processors with large core counts will able to utilize the ring bus for communications as well.
The cache box for the last level cache is the interface between the core/graphics/media and the ring as well as between the cache controller and the ring. This block is what implements the ring logic, arbitration and cache controller and communicates with the system agent on cache misses, external snoops and other accesses. There is a full cache pipeline in each box as well that maintains coherency and ordering for the addresses that are mapped to it.
The last level cache is shared among the processor cores, graphics and media blocks though the graphics driver software will control which streams are cached and coherent as it is the most bandwidth hungry application. Any agent on the ring bus has the ability to access any and all data in the LLC independent of who actually allocated the line to begin with. Multiple coherency domains in the cache include the IA domain for processing cores, the graphic domain that acts as a virtual cache and the non-coherent domain that is used for display output.
The system agent contains the PCI Express, DMI, memory controller and display engine functionality while also integrating the power control unit that has been upgraded on the Sandy Bridge processors. This microcontroller handles all of the power management and reset functions on the chip and is responsible for the new Next-Generation Turbo Boost technology. The system agent is able to manipulate the various power planes, one to itself, one for the cores, last level cache and ring bus and another for the processor graphics independently for better control.
Turbo Boost was one of the defining features the Nehalem-based previous generation Intel Core processors and Intel has tweaked it to allow for more performance gains with Sandy Bridge. The previous generation Intel Core processor family and its Turbo Boost technology assumed a classic model of thermal resistance that assumes instant temperature increases when clock frequencies and voltage increase due to processing loads. The newer, updated model takes thermal capacitance into account and matches a more realistic response to power changes.
Essentially, Intel engineers have decided that they can push the clock frequencies in Turbo Boost even higher for short periods of time while the processor die heats up. Before the thermal limit hits the maximum TDP the chip then lowers the clock speeds to decreased Turbo modes or the standard clock rates to maintain the proper sustained power. This means that for instances where users experience a heavy computing workload, like while starting up a new application, the clock frequencies can be run at a higher level than previously capable for a short period of time thus increasing overall responsiveness.
During idle periods, the system can accumulate additional energy budget and can accomplish higher power and performance for a few seconds once again. The benefits of this power budgeting is not just useful for the processing cores but can also shift between it and the processor graphics core. That means for temporary time spans the performance of either portion of the Sandy Bridge architecture will be able to exceed that of which it would meet at steady states.