Latency Distribution and Latency Percentile
For a very long time now, I have hated the idea of plotting average latencies for SSDs. The reason is that you could have an SSD with a great average but a group of IO’s falling under a horrible maximum latency. You’d think the answer is to then simply use the maximum latency figures for tests, but that unfairly biases the results against an SSD that had *just one* IO that ran high during the test (which could simply be an unlucky context switch on the test host system itself). The only way to properly solve this problem is to start tracking all IO’s during tests.
Iometer 1.1.0 default latency bins
Intel took a crack at this by adding ‘latency bins’ to Iometer. A new build was posted just before the March 2014 press event held at Folsom, where we were also briefed on these updates. Latency bins are ranges of latency in which each IO is ‘sorted’. Intel’s purpose for this Iometer change was to help demonstrate the performance consistency of their SSD 730 Series SSDs. This added some granularity to Iometer’s output (we were no longer stuck with just average and maximum latency), but the problem was that they were necessarily very coarse. They were just good enough to demonstrate that Intel's SSDs were not throwing excessively long IOs when pushed past saturation, but that was just about all they were good for. Adding more buckets easily overloads Iometer’s bin sorting routine (*every* IO latency must be sorted as it comes in, and you can’t add a bunch of code to a loop that may be executed over 200,000 times per second). With each bin covering such a wide range of latencies, you could have two different SSDs with all of their IO’s falling into the same bucket. In a worse case, the second SSD results might be just on the other side of the line, falling into the next bin and causing it to appear far worse than the first. So we definitely need more bins, or some other way of doing things. If we can increase the resolution of the capture, the resulting data can be used to create a clean histogram from the results and we can then plot the latency specific performance of a given storage device.
Since the Folsom event, I’ve been working out a better way to get what I wanted. Since no tool out there could do it, I would just have to roll my own. The only way around the coarse bin issue was to create a capture system that could give us infinite resolution of the IO latencies pouring in from the devices under test. Let’s start with an example of that output:
Latency Distribution on linear vertical scale (click to enlarge)
The X axis above represents the Latency Distribution. The scale is logarithmic, spreading latencies across six decades (every major mark is 10x greater than the previous), making the units 10ns, 100ns, 1ms, 10ms, 100ms, 1s, and 10s. This type of data would normally be presented as a histogram bar chart, but we have sufficient resolution that we can plot the data as an unsmoothed line. The 50/decade figure was simply chosen to make the plotting job easier on Excel, but it is more than sufficient for our purposes here, and significantly higher and more evenly spread than the ~20 total bins provided in the new Iometer. The resolution chosen for the above chart represents more than 300 bins!
The vertical scale represents the number of IOs that fall into a particular latency (for a given second). The IOPS of a storage device is equivalent to the area under the associated curve. Showing this axis linearly makes more sense, but sometimes we must shift to log scale when including devices with relatively low IOPS next to others with very high IOPS:
Latency Distribution on logarithmic vertical scale (click to enlarge)
With log scale, the three HDD results which were previously stuck on the axis line can now be seen. The thing to keep in mind when looking at the heights on the log scale is that the higher parts of the peak are more significant than the lower parts when figuring the latency of the majority of IO’s. Don’t rule out the lower parts entirely though (why this is important will be seen below). We shouldn’t dwell on the Latency Distribution, as the real benefit in obtaining the above results is the much clearer picture you can derive from them:
Latency Percentile (click to enlarge)
If the Latency Distribution was overwhelming, this Latency Percentile should make things a bit clearer. The plot lines represent the area under the curve of the previous plot (corrected to 100%). Each line will climb from 0% up to 100% as it accounts for every IO and its respective latency. It makes the latency profiles of these devices painfully clear, but it is important to remember that the lines do not represent or indicate the IOPS of the device. I have included IOPS as part of the legend to help keep things in perspective.
Here is the breakdown of the results, starting with the slowest:
- The three HDDs are obviously the slowest of the bunch here. Latencies range anywhere from .01s (10ms) to nearly a second. The Latency Percentile lets us see the clear distinction between the three disk speeds (5400 vs 7200 vs 10k RPM). Spinning the disk faster shifts the curve to the left.
- The trusty old SATA G.Skill FlashSSD is actually a rebranded first gen Samsung SLC SSD. It saturates far earlier than QD32, and it is very slow, but as we can see by the near vertical line that almost looks like one of the gridlines, man is that thing consistent. I’ve been using these as the OS drives in our storage testbeds for just this reason. Note that this SSD gives us ~30x the IOPS of the HDDs, but the faster SSDs here turn in 10-30x greater IOPS at the QD=32.
- Next up is the RevoDrive 350. Why is this monster of a PCIe SSD in the list *behind* a pair of *SATA* SSDs? It’s not so much the VCA controllers fault as it is the very long IO pipeline of the SandForce controllers it is pushing. Finally we are able to see just how much more latent SandForce (even a RAID of them in this case) is compared to other SATA SSDs.
- Next is the Kingston HyperX Predator, which is extremely close to the result of the Intel SSD 730 (see the zoomed version below to more easily see the spread).
- Next is the Intel SSD 730. To keep the scale in check here, we are now 1/10th the latency of the FlashSSD and 1/100th to 1/1000th the latency of the HDDs!. The 730 was great when it launched, and can still outperform the previous PCIe SSDs in this list, but it is now eclipsed by:
- The Samsung 850 PRO, which outmaneuvers the SSD 730 thanks to its faster controller and faster VNAND flash.
- The 950 PRO turned in similar IOPS and very close average latencies, but with the help of our new data we are able to tell where the differences between these two capacities lie. We see that the first 55% track nearly identically, but then the smaller capacity starts to taper off. This is likely because the 256GB model has half the die count compared to the 512GB model. With 32 IO’s stacked up in the queue, the model with the fewer dies has a greater chance of some of the IO’s piling up behind a given die, which means that some of the IO’s will have to wait just a little longer to be serviced. This leads to a longer taper towards 100%. If you go back and look at the first chart you may now be able to pick out the difference there as well.
Here is a final chart expanding out the faster SSDs:
Latency Percentile – Zoomed (click to enlarge)
Here are the devices tested, laid out in order of performance:
We have a lot more of this data to comb through (varying queue depths, percentages of drive fill, etc), and in future reviews we will be shifting away from the off-the-shelf benchmarks and more towards our fully custom solutions and results. The above results were from fully preconditioned and randomly written SSDs, but future consumer pieces will incorporate partially filling / fragmenting SSDs. QD32 100% Read was chosen as a workload representing heavy consumer-level random reads (booting process of a fully loaded system, heavy content level loads of games, simultaneous app launching, etc). Some of these SSDs can scale higher with higher QD, but that is unrealistic even on power user machines. Feedback on the above is welcome in the comments and will be taken into consideration as I further develop this testing.