Latency Weighted Percentile – Intro and Individual Results

Intro

Our exclusive Latency Distribution / Latency Percentile testing was a long time in the making and was first introduced in my 950 Pro review (longer explanation at that link). To put it briefly, the thing that contributes greatest to the 'feel' of storage device speed is its latency. Simple average and maximum latencies don't paint nearly the full picture when it comes to the true performance of a given SSD. Stutters of only a few IO's out of the thousands delivered per second can simply be 'lost in the average'. This applies even if the average is plotted every second. The only true solution is to track the latency of each and every IO, no small feat when there are potentially hundreds of thousands (or millions) of IO's being delivered by the fastest SSDs.

Latency Distribution (V1)

Here the data has been converted into what is essentially a spectrum analyzer for IO latency. The more IO's taking place at lower latencies (towards the left of the 'spectrum') the better. While it is handy for seeing exactly where latencies fall for a given device, the results are generally hard to read and digest, so the data is further translated into a percentile:

Latency Percentile Plot (V1)

For those unfamiliar with this plot, the ideal result is a vertical line as far to the left as possible. Real world storage devices under load will tend to slant or slope, and some will 'turn' prior to hitting 100%, indicating that some of the IO's are taking longer (the point where the line curves back upwards indicates the latency of those remaining IO's).

This new testing has come a long way since it was first introduced. The most recent and significant change is to correct a glaring issue common to all IO percentile plots, caused by a bad assumption similar to that which comes with using averages. V1 Percentiles were calculated from the percentage of total IOs, which was in-line with what the rest of the industry has settled on. You might have seen enterprise SDS ratings claiming 99.99th (or some other variation e.g. 99.9% / 99.999%) percentile latency figures. As an example, a 99.99 percentile rating of 6ms would mean that 99.99% of all IOs were <= 6ms.

There is a flaw inherent in the above rating method. Using the 99.99% <= 6ms example above, imagine an SSD that completely stalled for one second in the middle of a 6-second run. For the other five seconds of the test, it performed at 200k IOPS. The resulting data would reflect one million total IO's and (assuming QD=1) a single IO taking a full second. The average IOPS would still be a decent 167k, but that nasty stutter was diluted – effectively 'lost in the average'. The same goes for 99.99% ("four nines") latency, which would miss that single IO. Despite hanging the entire system for 17% of the run, that single IO would not get caught unless you calculated out to 99.9999% ("six nines"), which nobody rates for.

The industry has settled on calculating this way mainly out of necessity and the limits of latency measurement. Most tools employ a coarse bucket scheme, meaning 99.99% values must be interpolated. Fortunately, our data gathering technique gives us far greater resolution into the data, meaning not only can we minimize interpolation, we can do something previously impossible. Get away from IO-based percentages means we must time-weigh our Latency Percentile results by summing not just the IO's, but the time those IOs took to complete. When calculated this way, our hypothetical example above would show low latency only up to the 83% mark, where its result would ride that 83% line all the way to the one-second mark on the plot. With these percentiles now based on total time and not the unweighted sum of the IO's, we can more easily identify those intermittent stalls.

Latency Percentile – Time Weighted (V2)

I've created the above based on the new weighted method but using the same source data as the earlier V1 plot. This data was based on reads, which don't suffer from the same inconsistent latencies seen in SSD writes. Even with more consistent results, we can see a difference in the plotted data. The RevoDrive 350 (red line) doesn't quite make it past 99% as quickly as it did in the V1 plot, and some of the faster SSDs taper off a bit earlier as well. The three HDDs also saw an impact, as longer seeks take up more of the total time of the run. I'm going to change gears and get into the results now, but I will revisit this at the end of the drive-to-drive comparisons on the next page and show just how much of a difference the weighting made on writes.

Latency Percentile – Individual Results

The workload chosen for these tests consists of completely filling an SSD with sequential data, then applying 4k random writes to an 8GB span. This is not the same as 'full span writes', which is more of an enterprise workload. Instead, we are emulating more of a consumer type of workload where only 8GB of the drive randomly written (typically by system log files, registry, MFT, directory structure, etc). The following is a random read Queue Depth sweep (1-32) of that same area, which tests how quickly the OS would retrieve those previously written directory structures and registry files.

Reads

I was going with zoomed scales for all of these write runs, but I could not do so with the HyperX Predator. It's QD32 run had an interesting oddity. 111 out of the 11.5 million total IO's in that run (0.001%) fell at the 1/2 second mark (500ms). That is a significant issue, but if we were only charting unweighted IO's here, that plot line would have climbed to 99.999% way back at the 0.4ms mark, camouflaging the issue. Our weighting shows a line riding at 98% out to 0.5s, which is accurate, as ~2% of the time spent on this test was servicing those IO's.

All SSDs prior to this point are NVMe-based. The following two are AHCI-based. Note the QD=1 latency shifts to the right by a bit due to the added latency inherent with the added overhead of that protocol. This will be more apparent in the direct comparisons done on the next page.

 

Now onto what this same 8GB span looked like during writes:

Writes

While reads were not much to write home about (pun intended), writes are a completely different story when it comes to SSDs. A keen eye will note that each controller / flash pairing appears to have its own unique write signature, and they are all significantly different from each other. This is the case even for SSDs that have similar IOPS.

The RD400 showed an odd 'plateau' at the ~63% mark, which extended out all the way into ~40ms territory. Here is what that looked like in Task Manager:

Despite that issue, it did 'ramp up' to that 63% mark much cleaner than the 950 Pro (below), enabling it to turn in higher average IOPS figures. Remember – roughly 1/3 of the time spent writing was waiting on IO's that were taking nearly 50ms to complete (HDD territory).

The 950 Pro is a solid performer in writes, but keep on reading – it is not the clear winner here.

The Kingston HyperX Predator performs fairly well here, but tail latencies do stretch out past 10ms.

The Intel SSD 750 is just a powerhouse when it comes to random write performance. This SSD is rooted in enterprise applications where Intel was *extremely* Gung-ho on keeping write latency as low and as consistent as possible, and they clearly succeeded as is demonstrated here. QD=1 latency is so low that I almost need to expand the low-end range of the chart!

There's no easy way to say this. The M6e is a hot mess when it comes to our latency testing. The IOPS figures are low as it is, and the latency profiles clearly demonstrate why.

Now look at that. This is *not* an NVMe SSD. It is a *SATA* SSD. Just look at how consistent that thing is! The throughput is almost immediately bottlenecked by its much slower interface, which explains the clean and even shifts to the right from QD=4 up. IOPS also remains identical starting at that same QD=4. The reason the line can shift to the right but result in the same IOPS is that each doubling of QD means there are double the IO's "in flight", and each IO must then wait twice as long to be serviced. The log scale is responsible for the shifts to appear even. As a descendent of a long line of SATA controllers, the 850 EVO is clearly the epitome of consistency optimization here, despite the slower bus.

« PreviousNext »