You may remember the infamous Pentium FDIV bug, which could cause the wrong decimal results to be given in an answer to complex mathematical calculations which caused much consternation among scientists in the early 90's. Now there is a new bug to remember, found on Skylake processors, which can cause the processor to freeze during complex calculations such as you would do in Prime95 or if you contribute to the Great Internet Mersenne Prime Search project. The issue has been replicated on both Windows and Linux systems and on different motherboards, signifying that the issue does indeed come from the CPU. While having a freeze is certainly better than getting an incorrect result, it is still inconvenient and we hope that Intel's BIOS update will arrive soon. You can follow the detection and investigation of the bug and what is being done over at The Register.
"The good news is that the bug's triggered by complex workloads. It was turned up by prime number experts the Great Internet Mersenne Prime Search (GIMPS), who use Intel machines to identify and test new large prime numbers."
Here is some more Tech News from around the web:
- New AMDGPU Details & Looking Forward To Major Radeon Linux Improvements In 2016 @ Phoronix
- HTC Vive VR headset will be available to pre-order from 29 February @ The Inquirer
- Linux 4.4 kernel emerges with better support for Intel Skylake and Raspberry Pi @ The Inquirer
- Nvidia GPUs Can Leak Data From Google Chrome's Incognito Mode @ Slashdot
- Flat Camera Uses No Lens @ Hack a Day
- AVM FRITZ!Box 3490 AC1750 Gigabit Modem Router Review @ NikKTech
I really want to know what
I really want to know what the impact of the fix really is.
The fix for FDIV was to turn of the floating point unit and that is a huge impact on performance.
the fix for TSX problem was no more TSX but since almost no code used that it was not a big issue.
what is the fix this time?
waiting on the official word
waiting on the official word as of yet.
Two things:
1) the code being
Two things:
1) the code being run is older AVX (introduced in SandyBridge) and not the new FMA3 code. So, if the fix slows down AVX, it shouldn’t effect FMA3–or at least one hopes.
2) we still don’t know quite what is happening.
The code that fails does so because it has explicit checks in it for possible hardware flaws. This is important because the code can run for *months* to perform one calculation. Any little slip up in the middle can waste all of that effort, so there are periodic checks to make sure nothing went wrong part way through–and if they did, the code can roll back to a last known good state and start back up.
There may be code out there in the wild that is effected by this bug and we just don’t know about it because it doesn’t have similar checks in it. I hope Intel is forthcoming about what the problem was after they issue the microcode fix–actually, the fix is supposed to be in OEM hands already.
It says that the machine
It says that the machine freezes, not that it fails a check. Are you saying that it is freezing because it is running checks? I would be interested to know what the issue is. The 14 nm process still seems to have issues. The higher clocked parts still seem to be in short supply. I don’t know if a hot spot or other heat issue would fit the problem.
The article is incorrect and
The article is incorrect and is written by someone unfamilair with the problem. The problem was first reported at the Mersenne forum (www.mersenneforum.org). I’ve been a member there since 2008 when we moved off the mailing list and to the forum.
To clarify, there is no *crashing* nor *freezing*. The program throws an error when it detects a problem and stops working on the selftest. This is what it’s supposed to. I guess that doesn’t make as exciting of a headline. 🙂
Overclockers were the first to detect the issue, but they tried the normal route to solving overclocking issues–increasing voltages, backing off clocks, etc. They even underclocked the chips and the problem persists. It’s a logic problem on the chip, not likely something to do with the lithography.
They also ruled out thermal issues–they were running on water at 50C under full load and still had failures.
Thanks for the extra info ..
Thanks for the extra info .. honestly I don't play with Mersenne so I am not as familiar with the issue as I could be.
That said, from the sounds of it we are still not sure of the exact cause.
You’re welcome. I did a
You’re welcome. I did a longer post over at Tech Report if you want more info on it. You’re right in that we don’t yet know exactly what goes wrong–just sort of how to make it do so.
I’m UID 20 over at the MersenneForums, so I do have a pretty good understanding of what’s going on at least with Prime95.
I’ve never put a link in to a post here before, so bear with me…
http://techreport.com/news/29585/prime95-can-cause-intel-skylake-cpus-to-freeze?post=959799
No worries, we like
No worries, we like informative links.
Thanks for the info. It
Thanks for the info. It doesn’t sound like they have narrowed it down to a specific instruction or sequence of instructions yet. I wouldn’t expect a single instruction error to have made it through testing. Perhaps it comes out in a unique instruction sequence. I don’t know the mersenne prime algorithm, but I would guess it involves arbitrary precision arithmetic. I can look it up later. It could be a small number of people that are effected, as this is probably even more specialized than the FDIV bug. Hopefully encryption code isn’t effected; I would think that would have been noticed already if that was the case. It would be better if it did crash instead of producing incorrect results. With very long running code that supports check points, they sometimes run the same code twice, on different processors to insure the accuracy of the results. Depending on the algorithm, there are other checks that can be done. Although, did it fail the check, or did the checking code itself throw an exception or something?
I wrote more over at TR,
I wrote more over at TR, check the link above for that.
To answer your question and your inferences, yes, Mersenne testing uses arbitrary precision arithmatic implemented as a special form of FFT. The FFT from Prime95 is used in other projects, so if that is where the mistake comes in then there are a good number of other programs that might have similar bugs.
It is extremely unlikely that this will effect any form of encryption that I am familiar with. Even 4K RSA math is tiny compared to the numbers in Mersenne calculations.
The Great Internet Mersenne Prime Search (GIMPS) as a whole does use double checks run on different machines. When a potential Mersenne Prime is found a double check is run on a different machine and on a third machine of a completely different processor architecture using different code. That’s the computational version of “take off and nuke it from orbit” for any potential math or processor bugs. I’m familiar with this process because I did that check for the 38th discovered Mersenne Prime.
As a single test, Prime95 checkpoints and tests intermediate results just like you suggest a long running program should.
In the case of this error, Prime95 detected the error and displayed an error message. If this had been an actual real test, it would have rolled back and retried a few different times before failing the test. Since it was just a stress test, it displayed the error and stopped the self test immediately. Since the point of the selftest is to detect errors with a machine, not to actually test for primality.
I don’t know if encryption
I don’t know if encryption code will even hit the same units in a modern processor due to encryption specific instructions. It is interesting that hyper-threading seems to be necessary to trigger the bug. For consumer applications, I haven’t really seen that much of a need for it, so my recommendation has mostly been to just go with an i5.
Let me guess… the end of
Let me guess… the end of over-clocking non K Skylake CPUSs?
It is probably something that
It is probably something that is unlikely to occur in most consumer code.
I may have read my own
I may have read my own thoughts into the OPs question, but I think they meant “Will take this opportunity for a manditory microcode patch to stick in something that combats the non-K model overclocking?”
That seems like a reasonable fear.
Intel Buglake
Intel Buglake
It is AMD’s fault. Intel is
It is AMD’s fault. Intel is the most reliable company in the world, and Intel fanboys always trust Intel and NVidia stuff.
I do not care because I do not have money to buy the new stuff.
So no big deal as far as I am concerned.
That is one beautiful bug…
That is one beautiful bug…
I was trying to find one in a
I was trying to find one in a vacuum tube 🙁
How about vacuum tubes that
How about vacuum tubes that ARE bugs?
https://c2.staticflickr.com/4/3805/9278710144_9c6979daaf_b.jpg
nice
nice
“complex workloads”
I am
“complex workloads”
I am sorry, but there are no complex workloads.
You have either integer, float or bitwise workloads.
Complex = 2 floats.
Who tells me that it’s not their FPUs that don’t have the problem?
I’d like to see some server Skylake mobos do firmware updates!!
Just don’t run complex
Just don’t run complex workloads on servers, LOL!