80 Cores at 4 GHz
Intel has released new information on their Terascale processing project that includes the ability to run the 80-core processor at speeds upto 4 GHz!
During the Fall Intel Developer Forum in San Francisco this past September, Intel started to unveil information on its terascale processing projects. Terascale is basically defined as processing on terabytes of data on single machine requiring teraflops of power. Intel initially told us that the research being done by the terascale team was not intended for a particular product, or that it would even result in a sellable product, but as we have been getting more and more information, the likelihood of seeing this technology soon is increasing.
If you haven’t already, I would HIGHLY recommend you look over my original terascale computing article, as it covers the basics of how such an architecture functions and the benefits and drawbacks it offers for computing. In fact, the product that is being showcased with Intel’s announcement today is basically the same thing we saw at IDF last year; only now we are getting data on frequencies and computing horsepower that were left out before.
I will be including the terascale backup data (the general information that leads up to today’s announcement) after the new information is shown here; so look for the remainder of the article should you need more information.
An 80-tile 1.28 TFLOPS CPU
Yes, we are indeed looking at what is essentially an 80-core processor; one of the world’s first and probably most exciting. The basic architecture of the 80-tile design is based on the ideas of a NoC architecture, or Network-on-Chip, that contains hundreds of processing elements with integrated on-die communications. The tiles are arranged in a 10×8 2D mesh and can operate at speeds up to 4 GHz.
The above image is a breakdown of a single tile of the 80-core chip and it shows all the various networking and data components. Each of these 80-tiles consist of a processing engine connected to a 5-port router for passing data amongst the tiles with a bandwidth up to 256GB/s. On each tile’s processing engine (PE) there are two floating point units that are single precision. For data storage, the PE includes a 3KB instruction memory and a 2KB data memory.
Each of the FPMACs (floating point units) has a 9-stage pipeline that can reach a sustained multiply-add result (2FLOPS) every cycle. With dual FPMACs in each PE, the tile can provide 16GFLOPS of aggregate performance at the peak 4 GHz clock speed.
This unit is the 5-port wormhole-switched router that is able to provide 80GB/s of total bandwidth across the chip.
The chips clocking scheme allows for mesochronous timing to allow for communication between the tiles independent of the clock timings. The PLL (PLL: Wikipedia) responsible for the clocks runs on both the horizontal and vertical axis (called spines) and distributes the clock information in timing pattern shown on the right hand side.
One of the most interesting parts of this chips design is the amount of power control that has gone into it. Fine-grained clock gating, sleep transistor cycles and enhanced circuits all combine to reduce the power the chip uses all in the hardware itself. In fact, each of the 80 tiles has 21 smaller sleep-sections that can be activated separately and the tiles use a 6-cycle pipeline wakeup sequence.
This sleep cycle method serves purposes: 1) it mitigates the current spikes that might arise from 80 cores waking up simultaneously and 2) it allows the FPMAC execution (data processing) to start only a single cycle into that wakeup sequence. Essentially, each tile can begin processing data before the rest of it wakes up. In all, about 90% of the FPMAC transistors and 74% of the total of each PE is sleep-enabled.
Even more impressive, this chip is able to achieve incredibly high clock speeds on modest power usage. Running on a 1.0v current at 110 degrees C the tile maximum frequency is 3.13 GHz while at 1.2v the tiles can run at 4.0 GHz. That brings the peak processing performance with all 80 tiles functioning on block matrix operations to 1.0TFLOPS at 1.0v and 1.28TFLOPS at 1.2v. Power consumption at these levels is estimated at 98W and 181W respectively. The graph above also shows the peak performance / watt of 27GFLOPS/W at 0.6v power and only 11W of dissipation.
Finally, we have a layout of the chip itself that measures only 275mm^2 in area; that is 3mm^2 for each tile with some additional I/O area added in. Built on a 65nm process technology and using standard copper interconnects, this chip is designed with a unique 1248-pin LGA package design and uses 100 million transistors.
For those of you interested, I have uploaded the publicly available files on this announcement in PDF form, here.
This information that reached my inbox tonight is revolutionary beyond what I expected to see after being introduced to the technology late last year.
Here is a direct quote from the Intel PR:
“Intel has no plans to bring this exact chip designed with floating point cores to market. However, the company’s terascale research is instrumental in investigating new innovations in individual or specialized processor or core functions, the types of chip-to-chip and chip-to-computer interconnects required to best move data and, most importantly, how software will need to be designed to best leverage multiple processor cores. This Teraflops research chip offered specific insights in new silicon design methodologies, high-bandwidth interconnects and energy management approaches.”
Again, Intel is adamant about this product NOT being design with any specific purpose in mind, but I think they would be crazy to not further develop this technology into areas that could use the kind of processing power it provides. We’ve already heard talks about Intel going into the discrete GPU business, and our original look at the terascale computing projects looked at how this chip could handle real-time ray tracing. Such applications, as well as all kinds of super-computing algorithms could benefit from TFLOP performance on a single chip.
Here is another quote to get excited about:
“Further Tera-scale research will focus on the addition of 3-D stacked memory to the chip as well as developing more sophisticated research prototypes with many general-purpose IntelÂ® Architecture-based cores. Today, the IntelÂ® Tera-scale Computing Research Program has more than 100 projects underway that explore other architectural, software and system design challenges.”
Adding 3D memory to the terascale processor is a requirement to fill the huge amount of processing power this chip can provide with data to actually perform it on. Also interesting to see is how Intel might be able to apply a more generic x86-like architecture to such a tiled design to bring this kind of power to even more users that demand it.
In all, this new announcement only adds to the allure of such 80-core processors, even with the very specific uses that they might be helpful for in today’s world. As data and storage continue to increase though, the ability to process terabytes of information with teraflops of CPU power is going move from mere theory to reality.
Again, be sure to continue reading here to see my initial information on the terascale computing projects from Intel. There is much more detail on aspects like inter-tile communications methods, stacked memory and memory interfaces, terascale capable workloads and program development with terascale CPUs in mind.
Be sure to use our price checking engine to find the best prices on the Intel Core 2 Extreme X6800 CPU, and anything else you may want to buy!
Continue Reading – Terascale Architecture