Defining the Architecture (cont’d)
Currently, these processors are only in prototyping stages, but Intel has obviously done a lot of work on the feasibility of their ability to create these devices. They have been working on simulating and creating design methods that will lend themselves to terascale friendly developing environments. Two of these methods are answers to the on-die interconnect fabric (how all the cores talk to one another) and to find a way to get the memory bandwidth necessary for a 100+ core processor to run efficiently.
Here is the first step to the design — creating a tiled processor architecture, much like the images and diagrams we have showed you thus far. This core has an 8×10 arrangement of cores, for a total of 80, each with a compute element and a ‘router.’ This tiled design allows the cores to run in a mesochronous fashion; that means they can run at the same clock frequency, but without having to know what phase the other cores are in. In other words, the cores can vary in clock cycles slightly and not cause a major error in the system.
The compute element is the piece that does all the ‘work’ that each core is assigned while the router is left with the management position. It tells the core what to work on and is responsible for communication with the rest of the cores to share data or distribute processor load. And note that current designs show that each core would only be talking to each of its immediate neighbors: above, below, the left and right. Creating an interconnect between 80 cores wouldn’t be a simple task, especially one that would allow fast and easy communication.
The interconnect fabric between all of these cores is also an interesting subject. How DO you build a communications scheme that is scalable and fast for 20 up to 100 cores? Intel devised a 2D scheme (only talking to neighbors) for this and one that is self-aware on power states for cores. You can see the communication links from each core would be to its neighbors, to a synchronizer (to allow for mesochronous clocks) and some attached static RAM (SRAM).
How do you spread information to all of these cores if there are no primary cores (like in the Cell architecture) and if all the cores only connect to their neighbors?
One method Intel is investigating is the use of a ring topology, seen in the image above; it provides the fastest point to point connection method with the fewest number of traces and connections and is a popular computer science algorithm.
Another method for interconnects is through a mesh, though it would require more hops to get from far away core nodes, this method allows the CPU to avoid failed or very busy cores.
What about that memory bandwidth? We told you on the first page that Intel would need 1.2 Terabytes/s of memory bandwidth to keep this many cores busy and running efficiently. Intel’s answer is to attach 256 Mbit of SRAM directly to EACH core.
By doing so, Intel is able to create a 40 GB/s per tile memory subsystem, totaling more than 3 Terabytes/s of memory bandwidth across all 80 cores!
Intel has said the potential for the system to evolve into a ‘one router, many cores’ technology could happen, just as we have seen in the current Cell processor design.