With the clock speed arms race now behind us, the world has turned to increases in the number of processor cores to boost performance. As more applications are becoming multi-threaded, CPU core increases have become even more important. In the consumer space, quad and hexa-core chips are rather popular in the enthusiast segment. On the server side, eight core chips provide extreme levels of performance.

The way that most multi-core processors operate involves the various CPU cores having access to their own cache (Intel’s current gen chips actually have three levels of cache, with the third level being shared between all cores. This specific caching system; however, is beyond the scope of this article). This cache is extremely fast and keeps the processing core(s) fed with data which the processor then feeds through its assembly line-esque instruction pipeline(s). The cache is populated with data through a method called “prefetching.” This prefetching pulls data of running applications from RAM using mathematical algorithms to determine what the processor is likely going to need to process next. Unfortunately, while these predictive algorithms are usually correct, they sometimes make mistakes and the processor is not fed with data from the cache and thus must look for it elsewhere. These instances, called stalls, can severely degrade core performance as the processor must reach out past the cache and into the system memory (RAM), or worse, the even slower hard drive to find the data it needs. When the processor must reach beyond its on-die cache, it is required to use the system bus to query the RAM for data. This processor to RAM bus, while faster than reading from a disk drive, is much slower than the cache. Further, processors are restricted in the amount of available bandwidth between the CPU and the RAM. As the number of included cores increases, the amount of shared bandwidth each core has access to is greatly reduced.

NCSU Researchers Tweak Core Prefetching And Bandwidth Allocation to Boost Mult-Core Performance By 40% - Processors 2

The layout of a current Sandy Bridge Intel processor.  Note the Cache and Memory I/O.

A team of researchers at North Carolina State University have been studying the above mentioned issues, which are inherent in multi-core processors. Specifically, the research team was part of North Carolina State University’s Department of Electrical and Computer Engineering, and includes Fang Liu and Yan Solihin who were funded in part by the National Science foundation. In a paper concluding their research that will be presented June 9th, 2011 at the international Conference on Measurement and Modeling of Computer Systems, they detail two methods for improving upon the current bandwidth allocation and cache prefetching implementations.

Dr. Yan Solihin, associate professor and co-author of the paper in question stated that certain processor cores require more bandwidth than others; therefore, by dynamically monitoring the type and amount of data being requested by each core, the amount of bandwidth available can be prioritized by a per-core basis. Solihin further stated that “by better distributing the bandwidth to the appropriate cores, the criteria are able to maximize system performance.”

Further, they have analyzed the data of the processors hardware counters and constructed a set of criteria that seek to improve efficiency by dynamically turning prefetching on and off on a per-core basis. By turning prefetching on and off on a per core basis, this further provides bandwidth to the cores that need it. By implementing both methods, the research team was able to improve multi-core performance by as much as 40 percent versus chips that do not prefetch data, and by 10 percent versus multi-core processors with cores that do prefetch data.

The researchers plan to detail their findings in a paper titled “Studying the Impact of Hardware Prefetching and Bandwidth Partitioning In Chip-Multiprocessors,” which will be publicly available on June 9th. The exact algorithms and criteria that they have determined will decrease the number of processor stalls and increase bandwidth efficiency will be extremely interesting to analyze. Further, it will be interesting to see if any of these improvements will be implemented by Intel or AMD in their future chips.