Massive data sets, massive processing, massive bandwidth
We have been talking about tera-scale technologies since 2006 when it comes to Intel research programs. The name is perhaps more grandiose than the actual idea: as data sets increase in size the need for computing technologies to handle this amount of data will need to be created. It is no secret that the CPU as it exists today simply can’t handle the massive amounts of parallel information that will soon become normal operating procedure.We have been talking about tera-scale technologies since 2006 when it comes to Intel research programs. The name is perhaps more grandiose than the actual idea: as data sets increase in size the need for computing technologies to handle this amount of data will need to be created. It is no secret that the CPU as it exists today simply can’t handle the massive amounts of parallel information that will soon become normal operating procedure. NVIDIA and AMD will tell you that their GPUs are lined up well to address this problem, but Intel is thinking even beyond that. At this week’s International Solid State Circuits Conference (http://www.isscc.org/isscc/index.htm) in San Francisco Intel will be presenting some research papers on how to address these concerns and we wanted to give you an early preview of some of them.
While we know that tera-scale processors will likely have many processing cores on them, a pitfall of this design is how the chips communicate with each other internally on the die. Intel is hoping that the research they are doing in data sharing among many cores will lead to a definitive solution. The one time dubbed “Single-Chip Cloud Computer” (we actually wrote about it in December 2009) utilizes a mesh network and message buffering system to pass data from core to core without direct connections between ALL of the cores. This method allows data to transfer between cores as much as 15x faster than when using main memory.
The problem of course is that this method introduces a bit of latency and requires each “router” to have a non-trivial amount of storage for data handling. The current iteration of the packet-switching technology is currently up and running with 24 2.0 GHz routers and 48 1.0 GHz IA cores for a total of 2.0 Tb/s of bisectional bandwidth.
The other method for this network-on-a-chip in the image above uses a circuit rather than routers – the benefits include the removal of packet delays and improved power efficiency. The complexity increases of course to make this all work in a hardware form and thus is just starting prototyping deep inside Intel’s labs.
Another potential problem for tera-scale processing is not just internal communications but external communications for multi-chip systems or even component communication. Imagine trying to have multiple tera-scale processors on the same motherboard attempting to pass information at rates similar to the internal architecture. A method that Intel is testing uses a direct chip-to-chip connection that does NOT go through the CPU socket and/or motherboard design. Why? Intel says that by using this direct connection the high-speed communication can be accomplished with an order of magnitude improved power efficiency. Intel seems to think they can reach a terabyte/sec of bandwidth with just 11w of power as opposed to the 150w previously thought necessary.
Intel has been able to test and verify as much as 470 Gb/s of chip-to-chip communication using just 0.7w of power – an incredibly impressive feat. Not only does this method improve efficiency but it also allows the power consumption to drop to just 7% of normal value while in sleep mode and it “wakes” as much as 1000x faster than today’s options.
While interesting, this efficient I/O communication would likely come at the expense of interoperability and flexibility of system designs. A system built with this kind of direct chip-to-chip connections (instead of socket-based connections) might limit options down the road much like integrating memory controllers on to CPUs has done for motherboards, etc.
Intel has an interesting scenario built around optimizing task processing for many-core processors based on frequency and leakage variations on a per-core basis. Imagine an 80-core processor: not all cores are going to be able to run at the same frequency reliably. In traditional chip design thinking, Intel would lower the performance of ALL dies to match the worst case scenario in order to keep the product reliable and up to Intel’s standards. What if you let each core run at its own theoretical maximums and instead managed the threads and tasks independently?
Intel is looking at “thread hopping” technologies that would put priority on certain threads and tasks and place them on the cores that better suit the overall system. The cores that can run the fastest would be loaded with the highest priority tasks and as those complete the threads would move from slower cores to the faster ones (shown in red in our image above). If the system would like to run in a more power efficient model the CPU could map tasks only to those cores that exhibit the least amount of leakage; there are lot of directions this idea could take. Intel claims that a CPU could save anywhere from 6-35% of its energy consumption by mapping work to the best set of cores for each task.
Finally, the last option Intel discussed with us today was the idea of having processors adapt to extreme conditions. In what could be described as more aggressive form of Turbo Mode, imagine a processor that is not tuned for the “worst case” as they are today but instead will assume the best or nominal operating conditions. Instead of trying to prevent errors from occurring the chip would be built to look for these potential errors and problems, detect and handle them and then change the operating parameters as needed.
If, for example, a CPU notices a voltage error it would drop the frequency and then “replay” the operation to get the necessary result without the error. Obviously this needs some very complete monitoring utilities on-chip, but Intel started this trend with on-die power monitoring of the Nehalem architecture last year. It would also require a much more robust system of data monitoring in order to enable the “replay” option. Intel does think this method could offer either a 40% performance improvement or a 21% energy use reduction.
All of these technologies are like years away from any real-world integration, but seeing a preview of what processors might be like in the future is always intriguing.