High Resolution Quality of Service (QoS) 4KB Random
Required reading (some additional context for those unfamiliar with our Percentile testing):
- Introduction of Latency Distribution / Latency Percentile (now called IO Percentile)
- Introduction of Latency Weighted Percentile (now called Latency Percentile)
Intro to PACED workloads – 'It's not how fast you go, it's how well you go fast!'
I'd considered laying out the typical Latency Percentile and IO Percentile data before going into the QoS, but honestly, it's just a slaughter across the board, so I'll cut straight to the chase:
Quality of Service (QoS)
QoS is specified in percentages (99.9%, 99.99%, 99.999%), and uniquely spoken (‘three nines’, ‘four nines’, ‘five nines’). It corresponds to the latency seen at the top 99.x% of all recorded IOs in a run. Enterprise IT managers and system builders care about varying levels of 9's because those long latencies lead to potential timeouts for time-sensitive operations, and increasing the 9's is how they quantify more stringent QoS requirements. Note that these comparative results are derived from IO Percentile data and *not* from Latency Percentile data.
If you have a hard time wrapping your head around the 9's thing, It may be easier to flip things around and think about it from the standpoint of the remaining longest-latency IO's that haven't been accounted for as the plot progresses. As an example, the 99.9% line near the center of the vertical axis represents the top 10% of the top 1% (0.1%) of all recorded IOs, where 'top' means those IOs of the longest latency.
These plots are tricky to make, as they are effectively an inverse log scale. Each major increment up from the zero axis corresponds to the top 90%, and the next increment after that shows the top 90% *of that previous value*, meaning it's an asymptotic scale which will never reach 100%. The plots below essentially take the top portion of the IO Percentile results and spread them out, exponentially zooming in on the results as they approach 100%.
Read
Note that we have shifted the scale here to make it down to 1 microsecond as the P4800X is riding the 10us figure throughout these tests. We want QoS to ideally be a vertical line, and this is an extremely impressive result here. I didn't take these out to QD=256 as the P4800X saturates by QD=16 in all workloads. Further plot lines simply shift further to the right.
Here is an easier numerical chart plotting out the exact places where the QoS chart crosses the various latencies. Note the 50% mark (upper left), where the P4800X comes in at an average (50%) latency of less than 10us!
Alright, I've taken QD=1, 2, and 4 for the P4800X (blue), P3700 (green), and 9100 MAX (gold). Remember this is a log scale, so the competing products coming in a full major increment to the right indicates that they are 10x slower.
70/30 mix
With reads being the majority of this mix, the P4800X results are nearly identical to 100% read, while the competing products taking a right turn into even longer latencies due to the increase in write demand. The P4800X doesn't seem to care in the least about the added writes and continues to dominate.
Write
Now we see some elbows in the plot. Latencies are still great overall, but clearly the controller is doing some extra work, likely to provide wear leveling, etc.
Figures are still well within spec, though average (typical) latency has crept just over 10us. 100% writes is not a 'typical' workload, so I consider Intel's "Typical: <10us" claim to fall more into the 70/30 bracket covered earlier.
Finally someone comes to the party! Well, sorta. The 9100 MAX was able to beat the P4800X when pushing into the higher consistency metrics, but take note of the legend – it is only doing it at less than half of the overall IOPS (because the majority of its IOs are at a much higher latency).
One more comparison before we move on. Intel showed us a nifty QoS comparison between the P4800X and the P3700:
This chart ramps up IOPS while showing how QoS responds along the way. Where have I seen that before??? It's like those IOs are PACED or something 🙂
I've kept Intel's colors but added the Micron 9100 MAX (gold). Micron can reach higher IOPS loading at 70/30 before it saturates, but the P4800X's maximum (99.999%) latency remains lower than the average of the 9100 and P3700, while the P4800X's average latency is a full magnitude (10x) lower. I've added a few labels on the average plot lines to denote the QD associated with each product at that level of load. The NAND products have to push into virtually unattainable queue depths to reach the performance levels that the P4800X simply breezes through.
Thanks for the review(pre
Thanks for the review(pre consumer) of optane which I had been waiting for a while now. First none and now two, one on another site that I respect. Big thanks for the latency graphing from 1 clock cycle to a floppy drive. Very informative and something I was wondering about after getting a picture of intel placing the idea that it could be a go between storage and dimms. You test at very high queue depths but seem to state that some testing for a web server is not the best idea. Isnt it true that a webserver is the only place where high queue depths are to be seen? If so, and queue depths normally seen are much lower, where would one expect to see such high queue depths – or is it as you seem to say, its just a test to test?
Thanks for the article, I will have to wait for you to test again when you get one in your hands and likely find that consumers are at the door of another exponential shift like the one where ssd’s were used as boot drives when the price came down. We will more than likely start placing our Os’s on optane drives in our ssd system to gain additional quickness.
When they become available, Pcper “must” see what it will take to boot a computer in a second with a optane boot drive.SSD is 10 second possible. Nuff said.
Regarding high QD’s, there
Regarding high QD's, there are some rare instances, and it is possible that a web server could be hitting the flash so hard that it hits such a high QD, but if that happens I'd argue that the person specing out that machine did not give it nearly enough RAM cache (and go figure, this new tech can actually help there as well since it can supplement / increase effective RAM capacity if configured appropriately).
Regarding why I'm still testing NVMe parts to QD=256, it's mostly due to NVMe NAND part specs for some products stretching out that high. I have to at least match the workloads and depths that appear in product specs in order to confirm / verify performance to those levels.
I'm glad you saw benefit in the bridging the gap charts. Fortunately, my floppy drive still works and I was able to find a good disk! :). I had to go through three zip disks before finding one without the 'click of death'!
Holy smokes!
Hey great work
Holy smokes!
Hey great work here A, as usual.
Ditto that, Allyn: you are
Ditto that, Allyn: you are THE BEST!
> In the future, a properly tuned driver could easily yield results matching our ‘poll’ figures but without the excessive CPU overhead incurred by our current method of constantly asking the device for an answer.
Allyn,
The question that arose for me from your statement above
is this:
With so many multi-core CPUs proliferating,
would it help at all if a sysadmin could
“lock” one or more cores to the task
of processing the driver for this device?
The OS would then effectively “quarantine”
i.e. isolate that dedicated core from scheduling
any other normally executing tasks.
Each modern core also has large integrated caches,
e.g. L2 cache.
As such, it occurred to me that the driver
for this device would migrate its way
into the L2 cache of such a “dedicated” core
and help reduce overall latency.
Is this worth consideration, or am I out to lunch here?
Again, G-R-E-A-T review.
Locking a core to storage
Locking a core to storage purposes would sort of help, except you would then have to communicate across cores with each request, which may just be robbing Peter to pay Paul. The best solution is likely a hybrid between polling and IRQ, or polling that has waits pre-tuned to the device to minimize needlessly spinning the core. Server builders will likely not want to waste so many resources constantly polling the storage anyway, so the more efficient the better here.
for
for example:
http://www.tech-recipes.com/rx/37272/set-a-programs-affinity-in-windows-7-for-better-performance/
“Whether you want to squeak out some extra Windows 7 performance on your multi-core processor or run older programs flawlessly, you can set programs to run on certain cores in your processor. In certain situations this process can dramatically speed up your computer’s performance.”
I did some experimentation
I did some experimentation with setting of affinity on the server, and I was able to get latency improvements similar to polling, but there were other consequences such as not being able to reach the same IOPS levels per thread (typical IO requests can be processed by the kernel faster if the various related processes are allowed to span multiple threads). Room for improvement here but not as simple as an affinity tweak is all.
PCPER.com announces the first
PCPER.com announces the first ever CLONE AUCTION:
This auction will offer exact CLONES of Allyn Malventano,
complete with his entire computing experience intact.
Minimum starting bid is $1M USD. CASH ONLY.
Truly, Allyn, you are a treasure to the entire PC community.
THANKS!
I’m glad there was at least
I’m glad there was at least one comparison with the 960 pro, which is the most interesting graph in the article. I just wish there were more comparisons.
Your additional answers are
Your additional answers are coming soon!
Speaking of comparisons, I am
Speaking of comparisons, I am now very curious to know if Intel plans to develop an M.2 Optane SSD that uses all x4 PCIe 3.0 lanes instead of x2 PCIe 3.0 lanes.
Also, we need to take out a life insurance policy on Allyn, because we want him around to do his expert comparisons when the 2.5″ U.2 Optane SSD becomes available.
If Intel ultimately commits to manufacturing Optane in all of the following form factors, we should expect it to be nothing short of disruptive (pricing aside, for now):
(a) AIC (add-in card)
(b) M.2 NVMe
(c) U.2 2.5″
(d) DIMM
I would love to know that a modern OS can be hosted by the P4800X and all future successors!
PCIe 4.0 here we go!
Hello, Allyn!
Could you tell
Hello, Allyn!
Could you tell me, how did you manage to tweak FIO to perform polling for Optane P4800X under Windows?
I’ve read, how to do it under Linux only.
Thanks a lot in advance!
Regards,
Nick