Mixed Burst R/W Throughput, Load Times, and Latency Percentile
In an attempt to better represent the true performance of hybrid (SLC+TLC) SSDs and to include some general trace-style testing, I’m trying out a new test methodology. First, all tested SSDs are sequentially filled to 100%. Then the first 8GB span is pre-conditioned with 4KB random workload, resulting in the condition called out for in many of Intel’s client SSD testing guides. The idea is that most of the data on an SSD is sequential in nature (installed applications, MP3, video, etc), while some portions of the SSD have been written to in a random fashion (MFT, directory structure, log file updates, other randomly written files, etc). The 8GB figure is reasonably practical since 4KB random writes across the whole drive is not a workload that client SSDs are optimized for (it is reserved for enterprise). We may try larger spans in the future, but for now we’re sticking with the 8GB random write area.
Using that condition as a base for our workload, we now needed a workload! I wanted to start with some background activity, so I captured a BitTorrent download:
This download was over a saturated 300 Mbit link. While the average download speed was reported as 30 MB/s, the application’s own internal caching meant the writes to disk were more ‘bursty’ in nature. We’re trying to adapt this workload to one that will allow SLC+TLC (caching) SSDs some time to unload their cache between write bursts, so I came to a simple pattern of 40 MB written every 2 seconds. These accesses are more random than sequential, so we will apply it to the designated 8GB span of our pre-conditioned SSD.
Now for the more important part. Since the above ‘download workload’ is a background task that would likely go unnoticed by the user, we also need is a workload that the user *would* be sensitive to. The times where someone really notices their SSD speed is when they are waiting for it to complete a task, and the most common tasks are application and game/level loads. I observed a round of different tasks and came to a 200MB figure for the typical amount of data requested when launching a modern application. Larger games can pull in as much as 2GB (or more), varying with game and level, so we will repeat the 200MB request 10 times during the recorded portion of the run. We will assume 64KB sequential access for this portion of the workload.
Assuming a max Queue Depth of 4 (reasonable for typical desktop apps), we end up with something that looks like this when applied to a couple of SSDs:
The OCZ Trion 150 (left) is able to keep up with the writes (dashed line) throughout the 60 seconds pictured, but note that the read requests occasionally catch it off guard. Apparently if some SSDs are busy with a relatively small stream of incoming writes, read performance can suffer, which is exactly the sort of thing we are looking for here.
When we applied the same workload to the 4TB 850 EVO (right), we see an extremely consistent and speedy response to all IOs, regardless of if they are writes or reads. The 200MB read bursts are so fast that they all occur within the same second, and none of them spill over due to other delays caused by the simultaneous writes taking place.
Now that we have a reasonably practical workload, let’s see what happens when we run it on a small batch of SSDs:
From our Latency Percentile data, we are able to derive the total service time for both reads and writes, and independently show the throughputs seen for both. Remember that these workloads are being applied simultaneously, as to simulate launching apps or games during a 30 MB/s download. The above figures are not simple averages – they represent only the speed *during* each burst. Idle time is not counted.
Looking at the chart, we can see dips in write performance for the smallest capacity 840 and 750 EVO parts, but note how those lower speeds also result in a lower *read* speed. This is due to the SSD having to work harder to handle the incoming writes, so there is less time available to deal with the parallel read requests. The MX300 shows good write performance but reads are still hindered by the mixed workload. I included a 500GB 850 EVO (second entry) as a comparison point to the new 4TB model (top entry), and we can see nearly identical results from both parts here. The 500GB model was the capacity point where write speeds remained at SATA saturation even once the cache was filled. Good to see the 4TB model having no issue handling so much more capacity under this mixed workload.
The bottom two entries are fire-breathing NVMe parts, and while they are indeed faster than SATA parts, here we have a (simulated) real-world workload showing that they are not as speedy as one might think. The Intel SSD 750 is an absolute monster at writes, gobbling those 40 MB bursts at a rate of over 800 MB/s (less than 0.5 seconds each!), but even with all of that extra free time to handle the reads, Intel’s 18-channel controller doesn’t match the Samsung 950 Pro, which demonstrates far superior low queue depth read performance.
Now we are going to focus only on reads, and present some different data. I’ve added up the total service time seen during the 10x 200MB reads that take place during the recorded portion of the test. These figures represent how long you would be sitting there waiting for 2TB of data to be read, but remember this is happening while a download (or another similar background task) is simultaneously writing to the SSD. The 4TB 850 EVO turns in the quickest SATA time of all compared units. Most other similar units run mid-pack at around 5 seconds. The 120GB units must work harder with fewer dies, and they turn in times in the mid 6’s, while the MX300 brings up the rear. The NVMe parts clearly shine, but take care to notice that the SSD 750 is not even twice as fast as the fastest SATA part. The 950 Pro takes the crown, outmaneuvering Intel’s ‘consumerized’ enterprise SSD, but still only 2.4x faster than the 850 EVOs.
For those curious (and who enjoy reading subway maps), here is the Latency Percentile data that the above charts were derived from: