Microarchitecture and Graphics System Improvements
While the Haswell design is based mainly on the architecture we saw introduced with Sandy Bridge, there are some changes that Intel made to improve performance in the more typical fashion with an eye towards IPC (instructions per clock).
There were no changes in the key pipelines of Haswell but there were many areas that Intel said are "typical improvement points" for the company. The branch predictor has been improved as this is usually the best return on time investment from a CPU-design stand point; Intel increased the buffers on the OOO (out of order) structures in order to help improve the ability for the processor to find parallelism and take advantage of it.
Throughput also sees a boost, with 8 total ports on the reservation station with another ALU unit, another branching unit, and address store. This gives Haswell some improved metrics like two branches per cycle and two floating point MADDs per cycle – both improvements over what we saw in Sandy Bridge and Ivy Bridge processors.
New compute instructions expand on AVX, doubling both single precision and double precision FLOPs per core per cycle. Other new instructions accelerate very specific algorithms with updates for extract and deposits, bit manipulation, rotates, etc.
The cache implementation also sees interesting changes with Haswell including a doubling of the bandwidth to 32-bits wide and one L2 cache read every cycle. Seeing both L1 and L2 cache bandwidths double in a single generation without changing the organization and size of those structures is impressive, though it needs more explanation as well.
Another big upcoming change is the introduction of transactional synchronization extensions (TSX). TSX is a method to improve concurrency and multi-threadedness with as little work for the programmer as possible. By using these new ISA extensions, a developer can apply simple prefixes and suffixes to code blocks to indicate that they are independent and can be run in parallel. Hardware is then capable of managing transactional updates and restart execution if the required block isn't able to be run.
While this might be pretty specific to discuss with our audience, the implications are impressive. Increasing the parallelization of software is one of the key issues holding back innovation on many levels. We have seen the GPU vendors fight this (think CUDA) for years, and Intel's continued push into the MIC (many integrated core) markets will require it as well.
Graphics System Improvements
Perhaps more important than the x86 core changes are the improvements Intel has made with regards to the integrated processor graphics. While Ivy Bridge was rumored to be the death knell for discrete GPUs in the mobile market, both NVIDIA and AMD were able to find a place to market and sell their parts. Haswell looks to be much less forgiving.
The truth is that the graphics and media overview for Haswell is very similar to that of Ivy Bridge – including the same 6 domain partitioned architecture we saw at IDF last year. Domain 1 includes the typical setup and front-end action, domain 2 handles rasterization, domain 3 has the compute units (shaders) that Intel calls Execution Units (EUs). The fourth domain has CODEC engine, domain 5 is for video enhancement, and 6 is for displays.
This iteration will include support for DirectX 11, OpenCL 1.2, and OpenGL 4.0.
This segmentation of the processor graphics allows for the same kind of modularity that the entire Haswell design is dependent on. While the GT1 and GT2 options will still exist (as they do today with Ivy Bridge) the new hotness is the GT3 option that essentially doubles the computing power of the GPU; Intel calls this a "slice".
As I mentioned before, Haswell has decoupled the ring interconnect from the CPU so the GPU is able to pull more power over that bus to increase memory bandwidth without increasing the voltage to the CPU cores. Doing so lowers the required power consumption.
Obviously the setup stages of the processor graphics needed to be improved in order to handle the increased performance of the GT3 iteration, so Intel has doubled the performance of most fixed function units. The setup is able to push about 500 GB/s of internal bandwidth, and should be enough to keep the execution units (EUs) of the GT3 feed.
Finally, the texture sampler on the new processor graphics will see as much as a 4x improvement for some modes.
During Dadi Perlmutter's keynote today, we did see a comparison between Ivy Bridge and Haswell running the DX11 Unigine Heaven benchmark – though no specific settings were given.
Though you can't really see it in a still photo, the Haswell result was easily a 2-3x improvement in frame rate based solely on the animation appearance. While we will likely have to wait until Q1 or later in 2013 to get the full details, Haswell's graphics performance looks like impressive.