Something Old, Something NewHey, I Remember You!
CPU design has advanced by leaps and bounds since the original 8086 from Intel. Since these simple beginnings we now see products with large caches and complex out-of-order pipelines as well as SIMD/MIMD structures which accelerate specific workloads, and multiple cores on a single die. Clockspeeds have also gone from the 5 and 10 MHz range up to the current 3 GHz+. When Intel announced that it was not creating a new architecture for Larrabee from the ground up, it caught many people by surprise. What was most surprising is that Intel is basing the core design on the old Pentium architecture. When I say Pentium, I mean the original Pentium 60 MHz, which scaled up to 200 MHz.
The original Pentium is a dual issue superscalar, in-order architecture which made extensive use of large (for the time) L1 caches. It was overall a slightly different and forward thinking approach as compared to the previous 286/386/486 architectures. While it was not all that much faster at the time to higher clocked 486s, it turned out to scale very well and soon left the older architectures in the dust. While Intel continued using the Pentium name for products not based on the original Pentium architecture, it had proven to be a more than adequate performer for the marketplace at the time.
The scalar unit (the Pentium core) and vector unit both share the same L1 and L2 caches, along with a single connection to the ring bus.
When Intel was considering what to use to process graphics workloads, and potential stream and high performance computing, the most obvious answer was to use multiple cores on one die. Now instead of creating new units like NVIDIA and AMD did with their stream processors, Intel thought that it might go back to previous IP and utilize something from its portfolio. This would get around most potential problems concerning developing a new architecture from the ground up, as well as causing conflicts with other IP from NVIDIA and AMD. So the first decision that Intel made was to use the old Pentium architecture to create a chip with many, many cores, all running as efficiently and simply as possible. Because the Pentium is an in-order architecture, it is quite a bit smaller than an out-of-order product (which has better efficiency in single core performance, but requires many more transistors and a lot more space to implement well).
Shoving 16+ Pentium cores onto a single die does not however make for a good graphics part. X86 integer and floating point units are not uniquely situated to adequately perform under such conditions that 3D graphics require. Intel realizes this, and has made some major modifications which become the basis for the Larrabee architecture.
The core architecture that Larrabee is based on is of course the dual issue, in-order Pentium architecture. Intel is planning on putting many of these cores on a single die to provide the performance and power needed to push pixels onto our screens, as well as provide a common processing architecture for high performance and stream computing. Intel has also added a vector unit to the core which will handle the majority of the floating point grunt work which is required by current and future 3D graphics workloads. Just as Intel did with the original Pentium and MMX, it is adding on extensions to support the vector unit.
The vector unit is capable of handling sixteen 32 bit operations per clock, which can give it a pretty hefty theoretical output in floating point operations. These vector units can act on 32 bit integer instructions, 32 bit floating point instructions, and 64 bit floating point instructions. This does give them quite a bit of flexibility in dealing with workloads that can either be integer or floating point based. This is in direct opposition to the stream processors of both AMD and NVIDIA which are floating point based (integer formats are converted to floating point in hardware).
Texturing is a very intensive operation which is most definitely not suited to pure X86 computing. The texture unit on Larrabee is one of the few fixed function units located on the chip. As such it does not feature the programmability that current AMD and NVIDIA parts have, but it does seem to cover the basics of texturing. Filtering and decompression are the primary functions of this unit, but without it the CPU and vector portions would take 12x to 40x as long to perform these operations as compared to a fixed function unit focused on these workloads.
The ring bus connects all the functional portions of the chip, allowing for fast and relatively low latency communications. Because of the relatively large caches of the individual cores, any kind of latency is covered up.
Every other rendering function is performed by the X86 core and vector units. Basically other than texturing, every potential operation being done in the rendering pipeline can be programmable. This part could be a real blessing or a real curse for Intel. We can see how well AMD initially did with its programmable color resolve with MSAA, and Intel is taking this several steps further. In theory having a fully programmable raster unit and frame buffer could unleash some unique and interesting effects in 3D rendering, but it could also impose a speed penalty to many common usage models and applications.