Accelerated Testing
Controller:
With all of the development and design stuff out of the way, lets get into the good stuff. The first point to bring up is the fact that feedback is inherent in this process. The early validation steps may very well need to 'reach back' into the process in order to find solutions. Here's a good example:
Above is a bit error caused by a 'cosmic ray'. No, I'm not kidding. The earth is constantly bombarded by cosmic rays, many of them neutrons. Neutrons are mostly filtered out by the atmosphere, but we still see some of them down here on earth. We also make a bunch of them right here on earth. It doesn't just come from nuclear reactors, simply eating a banana causes your own body to emit neutrons as your body breaks down the potassium. While neutrons carry no electrical charge themselves, they 'excite' atoms they happen to run into along their path. Excited atoms don't like staying that way for long, so they quickly decay back down to a stable state. Part of that decay is in the form of an electron, which effects the charge of the surrounding atoms. If this event happens in a flash cell, the ECC mechanisms correct the error. The same ECC correction happens if a neutron happens to cause a bit to flip in the SRAM. Despite this, there are still places within an SSD where flipping a random bit of data can cause issues. Most of these take place within the controller itself, where a flipped bit can potentially cause data to be misrouted or not routed at all while still reporting to the host that it has been written (i.e. lost). In the worst cases, the controller might not be able to continue executing its firmware and would result in a soft reboot or even bricking of the device.
These cosmic ray events don't happen very often (we're talking billionths of a percent chance spread across thousands of devices), but they remain a possibility and do play into the design on the controller as a whole. Controllers tend to stick with the 'larger' lithography process nodes, so that the charge from a cosmic ray event has less of an effect on the overall voltage present at a given location. Extra checks are added to the firmware as a means of catching incorrect operations caused by flipped bits.
A failed SSD being analyzed on a test bench at Intel.
Now with all of these corrections in place, and with the chances of a neutron flipping a bit so low, we cant exactly put hundreds of thousands of unreleased SSDs out in an open field in the hopes of seeing failures happen. The process needs to be accelerated. "How", you might ask?
Just use an accelerator! Yes, Intel actually sends their prototype SSD controllers (among other things) out to Los Alamos to be bombarded with a neutron beam dozens of orders of magnitude higher than what they will see in normal use. These are literally tests to the point of failure. They then go back in and see what failed, how, and why. The results are again fed back into the design loop and the process is repeated if necessary after firmware (or even hardware) corrections have been made.
Flash:
Accelerating the testing of flash memory in a modern SSD is a tricky proposition. Thanks to advanced wear leveling techniques, writing via the normal method, at full speed, can take months or even years before flash blocks begin to wear to the point of noticeable failure. Tricks implemented from the outside really don't work. 'Short stroking' the SSD by writing to a smaller range of (external) LBA sectors does nothing, as the wear leveling algorithm will still spread those writes across the entire flash area (and this is why SSDs random write performance improves with greater over-provisioning at play – because there is more 'empty' flash to work with). Given the above, accelerating the wearout testing of flash requires a bit of a firmware tweak:
Now remember, we're trying to test the entire production unit here for any possible failures – in addition to flash failures. To accomplish this, Intel makes the most minimum possible modification to the firmware, instructing it to address only a portion of the flash dies within the SSDs. All data channels are still used and all flash dies are still accessed, but the addressable area of each is reduced to a fraction of the full surface. The diagram above depicts using the area at the edges of the dies, because this is where the failures are more likely to occur (due to handling and packaging). This effectively makes the SSD have a much smaller capacity, which means that writing at the same speed translates to increased wear to those focused areas. This is the same 'short stroking' mentioned above, but since it occurs at the die level, wear leveling is restricted to the same smaller area, and those smaller sections of flash can then be tested to failure within a reasonable amount of time (6 weeks in the example above).
great article, super
great article, super interesting read. However one this is irking me. The line is “The proof is in the EATING of the pudding” I know this is a stupid thing to bitch about but that one just gets on my tits, y’know?
Yes, but in the case of
Yes, but in the case of banana pudding, it makes you temporarily radioactive 🙂
Wow! Impressive article. I
Wow! Impressive article. I love behind the scenes articles such as these. So much is involved in quality products.
Allyn Malventano,
Great
Allyn Malventano,
Great article, but I am wondering…. How much would that wafer be roughly worth if that stiff arm worked?
All depends on the yield for
All depends on the yield for that particular wafer. A bad run and it wouldn't be worth much at all. Also, to know the true result of my 'stuff arm' success, check out the podcast some time 🙂
Hi
The “X25-M performance
Hi
The “X25-M performance degradation bug discovered and reported by PC Perspective” was not discovered by PC Perspective 😉
February 13, 2009
http://www.pcworld.fr/stockage/tests,ssd-intel-x25-m-80-go-une-bombe-problemes-rencontres-avec-le-ssd-intel,87201,87311.htm
<= September 8, 2008 http://www.hardware.fr/articles/731-6/supertalent-intel-performances-variables.html
<= September 8, 2009 In fact it doesn't take 3 month to fix this problem but ... 8 month ! 😉
Interesting. We hadn’t seen
Interesting. We hadn't seen that piece, and neither had Intel apparently, as in the communications about their first firmware update, they credited us with its discovery.
Excellent article Allyn!
Excellent article Allyn!
Excellent article Allyn!
Excellent article Allyn!