TRIM Performance and Write Hitching
TRIM Performance:
For those unfamiliar with TRIM, it is the method by which an OS tells an SSD that specific areas no longer contain valid data. As an example, if you delete some files from your SSD, the OS removes those entries from the Master File Table and also issues TRIM commands covering the location where those files were stored. SSD performance increases with the number of free flash blocks available, so TRIM helps keep SSDs performing fast over time.
A working theory I had at the introduction of TRIM was that an SSD would take some time to do the necessary upkeep (block erasures, metadata table pruning, garbage collection, etc) after a TRIM command was issued. In my previous attempts to develop such a test, I found that SSD controllers and firmwares all handled TRIM operations in such a way that did not interfere with subsequent reads or writes taking place. The additional necessary operations appeared to be logged and handled as a background task, and could only be seen by watching SSD power consumption. I saw that some SSDs remained at an active power state for a few seconds or minutes after performing large TRIM operations (i.e. a quick format of an SSD previously filled with user data). Since the drives I tested at that time did not show any appreciable performance impact from those TRIM operations, I was left with essentially nothing to report, and therefore the idea of a TRIM Performance Test died on the vine…
…but then I tested the Vector 180.
I saw some abnormal results in our PCPer File Copy Test, where performance *decreased* as drive capacity *increased*. This is the opposite trend of what is normally expected, as typically larger SSDs will fare better in this test. Digging further into the results, I noted that the file copy progress was coming to a standstill multiple times through the test. Here's an example:
The above sequence takes place between a set of large file copies, with a set of (~10GB total) files having been deleted just prior. As you can see, the 480GB Vector 180 stalls all operations, once for ~4 seconds, followed by another stall of just under 10 seconds. This effect appears proportional to capacity. The 240GB stalled for shorter periods of time (roughly half), while the 960GB easily went past the 10 second mark during this particular test. Those stalls add up, and pushed the total time to complete our file copy test up to abnormally high levels.
An additional issue related to TRIM performance is the time taken to partition and quick format a previously filled SSD. After running a full HDTach pass (which writes the SSD completely), the 240GB model takes one minute, the 480GB two minutes, and the 960GB four minutes. Those don't seem like long periods of time, but four minutes to accomplish something that nearly every other SSD (and even HDD for that matter) accomplishes in a few seconds is something that may be unacceptable to the enthusiasts this SSD is marketed towards. These times also scale proportionally with the size of files deleted. As an example, deleting 20GB of files would results in a hang of subsequent writes for ~5 seconds (regardless of capacity). A rough 'TRIM Rate' estimate of 4GB/sec seems to apply regardless of capacity, and any attempted writes will hang for that period of time, which is what makes this such a bad issue for the Vector 180 to present. Delete 100GB of files and all future writing will not happen for ~30 seconds. In cases where this is your OS drive, you may see other OS operations (launching other apps, etc) hang during that same period of time.
Write Hitching:
In addition to the pauses after deletions I noted some very odd behavior during sustained writes to the Vector 180. All three drives exhibited a periodic stutter during writes. This occurred regardless of type of write (sequential, random, etc). It also occurred during the 30% write workload in our Iometer Database test, causing dips at the same point in that test sequence. Investigating this further, the pauses occur every 20 seconds. We noted this same periodicity on all drive capacities, but the same proportionality seen in our TRIM results also existed. Pauses were of shorter duration on the 240GB model, ~2x longer for the 480GB, and again 2x longer on the 960GB. Here is an example:
Note the reported write speed at the interval where this screen shot was taken. For at least one second, the Vector was writing at 0 MB/sec. Here is a similar look, but this time what a typical user would see while trying to write files to the SSD:
We brought these issues to the attention of OCZ, and they issued us a new firmware (1.01) that helped improve TRIM and format speeds (the above results are the improved figures), but that firmware did not correct or improve the 20 second write hitches seen. When pressed further, OCZ issued a statement, which I will leave you with here:
“Thank you for your inquiry in regards to the I/O behavior on the Vector 180 Series. What is being observed is a characteristic of the design of the drive itself and is a result of the firmware performing updates to its metadata mapping table and flushing the entire table out of DRAM and onto the NAND flash, during which I/O throughput is impacted for very brief periods. Our metadata management is done on a frequent basis to prevent failure modes related to bricked drives as a result of metadata corruption, which can potentially happen on other non PFM+ enabled SSDs as a result of unexpected power loss. This is observed to a greater extent on the larger drives (960GB) where there is more metadata to manage. While this phenomenon is observable in synthetic benchmarks, there is virtually no impact to typical client grade end-user applications and during real world use. With the Barefoot 3 based Vector 180 Series design, we strove to deliver the optimal balance of performance and reliability for our valued enthusiast and workstation customers.”
I'm not sure I agree with OCZ's claim that these effects are not observable in typical client behavior. Copying files to an SSD is a fairly regular thing to do for enthusiasts, who would even go as far as having that copy running while performing other SSD-intensive tasks (which would also be delayed should they take place during one of those 20 second intervals).
Further, if this flushing is necessary as a part of their PFM+ technology – for OCZ to ensure their SSDs do not "brick", I'm left wondering why all of the other SSD makers out there are able to avoid this necessary data flushing and yet those SSDs do not fail when power was unexpectedly removed. If this type of metadata corruption on power loss was such a common occurrence, every power outage would see at least a hand full of SSD-equipped desktops no longer able to boot. To me it appears that PFM+ is a band-aid solution to a design problem inherent in the Barefoot M00, and that the only way to prevent the Vector 180 from bricking was to compromise by just halting all writes every 20 seconds while it saves its metadata – a task that every other SSD maker is able to accomplish (in equivalence) without interrupting writes from the host system.
Man, Allyn, you really
Man, Allyn, you really reestablished my faith in you. Historically, you have been very easy on OCZ SSD’s, often giving them the benefit of the doubt with respect to problems with them at review time that you assumed would be fixed eventually by firmware updates and lower pricing. Just saying, with the conclusions you reached, this is a drive I will definitely be steering clear of.
thanks
As the saying goes: “Fool me
As the saying goes: "Fool me once…"
“There’s an old saying in
“There’s an old saying in Tennessee — I know it’s in Texas, probably in Tennessee — that says, fool me once, shame on — shame on you. Fool me — you can’t get fooled again.”
There’s some genuine
There’s some genuine investigative reporting going on there in the fifth page of this review and it’s very refreshing. Nicely done Mr. Malventano.
In my view page 5 basically blows the lid off of OCZ and the reliability of their Barefoot controller. Despite reporting from most outlets, for years now drives based off of this technology have suffered massive failure rates due to sudden power loss. Here we have definitive evidence of those flaws and the lengths OCZ is going to in order to work around them (note, i didn’t say ‘fix’ them).
The fact that they were willing to go to the extra cost of adding the power loss module in addition to crippling the sustained performance of their flagship drive in order to flush the cache out of DRAM speaks VOLUMES about how bad their reliability was before. You don’t go to such extreme – potentially kiss of death measures – without a good boot up your ass pushing you headlong toward them. In this case said boot was constructed purely out of OCZ’s fear that releasing yet ANOTHER poorly constructed drive would finally put their reputation out of it’s misery for good and kill any chance a future sales.
OCZ has cornered themselves in a no win scenario:
1) They don’t bother making the drive reliable and in doing so save the cost of the power loss module and keep the sustained speed of the Vector 180 high. The drive reviews well with no craters in performance and the few customers OCZ has left buy another doomed Barefoot SSD that’s practically guaranteed to brick on them within a few months. As a result they loose those customers for good along with their company.
or
2) The go to the cost of adding the power loss module and cripple the drives performance to ensure that the drive is reliable. The drive reviews horribly and no one buys it.
This is their position. Kiss of death indeed.
Ultimately, i think it speaks to how complicated controller development is and that if you don’t have a huge company with millions of R&D funds at your disposal it’s probably best if you don’t throw your hat into that ring. It’s a shame but it seems to be the way high tech works. (Global oligopoly, here we come.)
All things considered, it’s nice that this is finally all out in the open. Thanks Allyn.
Somehow I’m not suprised.
Somehow I’m not suprised.
You tested the Vector 180
You tested the Vector 180 with the new 1.01 firmware, but was the Radeon R7 also updated to 1.01, as OCZ recommends? Does it show the same write hitching?
The R7 results in this piece
The R7 results in this piece were based on the initial firmware. We're going to take a closer look at all other M00 based drives (with updates applied) now that we've uncovered this behavior.
beta testing on consumers.
beta testing on consumers. fail products fail company fail fail fail! all computer parts should handle power fails the same way: not requiring a 2-3 week rma. ocz should not exist anymore.
>Blah-blah-blah
>Blah-blah-blah walloftextsomethingsomething blah-blah-somethingwalloftext-blah-somethingsomethingwalloftext >aaand…it’s crap.
You should have said so right away, d00ds.
“Write hitching”. I first saw
“Write hitching”. I first saw that and all I could think of was the the old stuttering jmicron controllers….on older OCZ drives no less. Bad memories.
Now you show compelling evidence on why you might want to flat out avoid drives with Barefoot controllers.
Love these in-depth articles. Awesome job as usual.
Even though JMicron was
Even though JMicron was always slow as hell (and still is even these days), AT LEAST IT DIDN’T FAIL RIGHT OUT OF IT’S ASS, like that SandForce trash does all the time. JMicron’s stuff slow, true, but also one of the more reliable controllers out there.
As a data point, no JMicron
As a data point, no JMicron controller we ever tested halted all writes for more than one second, and it certainly didn't do so every 20 seconds.
How long is it going to take
How long is it going to take to forget-
“Friends DON’T let friends OCZ”
Unless you’re buying their
Unless you’re buying their “extremely-rare-now-since-they’re-not-doing-them-anymore” PSUs, that is.