When AMD’s Bulldozer processors arrived, they were unable to best Intel’s fastest at most tasks. A number of users held out hope for Bulldozer; however, as it was discovered that Microsoft’s Windows 7 operating system was not optimized to take advantage of the multi-threaded execution scheduling engine. While MS has implemented this optimization in the Windows 8 kernel, the current stable release has been without a fix until recently. The fix in question is available for Windows 7 and Windows Server 2008 R2 and can be downloaded here. It should be noted that service pack 1 is a pre-requisite to this hot-fix.
Conservatively, previous indications suggested such a fix would add a 5 % to 10 % performance boost in multi-threaded applications. That number is based on the estimates from around the web from people comparing benchmarks between Windows 7 and Windows 8 Developer Preview. If you are running a Bulldozer processor in your machine, be sure to apply this update and let us know how performance improves.
Image courtesy Ezio Melotti via Flickr.
Just wondering, will there be
Just wondering, will there be a followup article to identify if this patch indeed gives the 5-10% increase.
I wouldn’t expect the patch
I wouldn’t expect the patch to make any significant difference to single threaded code, and I wouldn’t expect the patch to make any significant difference to code that loads up all 8 ‘cores’ equally. I would expect the patch to make the biggest difference to codes that use 4 out of 8 ‘cores.’
AFAIK, what the patch does is prefer to schedule busy threads in different modules. You’ll recall Bulldozer has 4 ‘modules’ with two ‘cores’ each. Turns out AMD would prefer it if the scheduler not treat all BD ‘cores’ as equal. Turns out BD is happier and faster if only one ‘core’ per ‘module’ is heavily used. If this situation makes you want to equate the AMD term ‘module’ with the Intel term ‘core’ and equate the AMD term ‘core’ with the Intel term ‘thread’ you are not alone. There are slightly more execution resources available to an AMD ‘core’ than to an Intel ‘thread’ but AMD ‘cores’ get less L1 and L2 cache than Intel ‘cores’ and the AMD L2 cache is significantly slower than Intel’s Sandy Bridge (the AMD L1 cache is also 1 clock slower).
IMHO, Bulldozer’s performance problems have mostly to do with the cache. Scheduling one thread per ‘module’ avoids a little bit of cache contention, but it does nothing about the slower L2 cache.
Actually, this note should be
Actually, this note should be updated with the fact that the hotfix was pulled, since it was evidently accidentally posted, and is only one part of a two-part fix that isn’t yet complete…