Saturated IOPS Performance – 4KB, 8KB Random, 128K Sequential
I'm carrying over the IOPS vs. % Read charts from my P3608 review. The former IOPS vs. Latency plots also used in that review has been superseded by the far superior Percentile method (on the next page). With sweeps of R/W in 10% increments and all Queue Depths covered, there's a lot of data on each chart, so here I have listed the charts sequentially but matched the scales of each pair for easier A/B comparison.
Note that since we are plotting a Read/Write percentage spread, we no longer need to include other specific workloads (OLTP, Database, etc), as those workloads are included as a part of the below charts. For reference, here is the IO distribution of typical purpose-specific workloads:
- Database / OLTP: 8KB 67/33 (or 70/30)
- Email Server: 8KB 50/50
- File Server: 80/20 of the following:
- 10% 512B, 5% 1KB, 5% 2KB *
- 60% 4KB, 2% 8KB, 4% 16KB, 4% 32KB, 10% 64KB
- Web Server: 100/0 (read only) of the following:
- 22% 512B, 15% 1KB, 8% 2KB *
- 23% 4KB, 15% 8KB, 2% 16KB, 6% 32KB, 7% 64KB, 1% 128KB, 1% 512KB
* We have discontinued the File Server and Web Server tests currently used by many other sites, as they employ legacy workloads that are 16 years old (yes, in the year 2000) and are simply no longer representative of modern technology. Specifically, modern enterprise SSDs are no longer optimized for <4KB random, yet the outdated Web Server workload applies nearly half (45%) of its workload at those 'wrong' sizes. While it makes for an interesting spread in the results showing artificial penalties with SSDs optimized for 4KB, those results are just no longer meaningful in modern day enterprise use.
4KB Random
Alright, starting out with 4KB random performance, we see a very linear response between some very impressive numbers here. Anything over QD=8 turns into that saturation blob at the top line there. I've never seen anything ramp up on IOPS so quickly. Lets put this in a bit more perspective by adding in the Intel P3700 and Micron 9100 MAX:
Holy crap! The P3700 (green) and 9100 MAX (gold) are literally wiping up the floor at these lower queue depths, while the P4800X just walks all over them – even its QD=1 performance is higher than the other two at QD=4 nearly across the board! Let's look out to the longer queue depths here to see if they can catch up:
Ok, so singling out reads, writes, and a 70/30 mix, only the Micron 9100 MAX is able to beat the P4800X in 4KB random performance, but in order to do so it must operate at a QD of nearly 128 to reach the same level seen by the P4800X at 1/10th the Queue Depth!
8KB Random
8KB random performance is very much the same story as it was with 4KB, the only exception being the P3700 gaining a tad more ground but still falling short overall.
128KB Sequential
I've marked the 'meager' 2GB/s / 2.2GB/s specs obtained from the P4800X specification leak, ghosted here as I'm not considering them final specs (and we were not provided the specification for this review). Note that the P4800X takes just a single step at QD=1 before reaching its saturation throughput at QD=2. Insanity! (The second data line you see is actually all other QDs results overlapped.)
Even though Micron's 9100 MAX can reach higher sequentials, it requires very high queue depths to do so. While the P4800X may not climb as high as the others, it gets there at the lower queue depths, which is where it really counts.
Thanks for the review(pre
Thanks for the review(pre consumer) of optane which I had been waiting for a while now. First none and now two, one on another site that I respect. Big thanks for the latency graphing from 1 clock cycle to a floppy drive. Very informative and something I was wondering about after getting a picture of intel placing the idea that it could be a go between storage and dimms. You test at very high queue depths but seem to state that some testing for a web server is not the best idea. Isnt it true that a webserver is the only place where high queue depths are to be seen? If so, and queue depths normally seen are much lower, where would one expect to see such high queue depths – or is it as you seem to say, its just a test to test?
Thanks for the article, I will have to wait for you to test again when you get one in your hands and likely find that consumers are at the door of another exponential shift like the one where ssd’s were used as boot drives when the price came down. We will more than likely start placing our Os’s on optane drives in our ssd system to gain additional quickness.
When they become available, Pcper “must” see what it will take to boot a computer in a second with a optane boot drive.SSD is 10 second possible. Nuff said.
Regarding high QD’s, there
Regarding high QD's, there are some rare instances, and it is possible that a web server could be hitting the flash so hard that it hits such a high QD, but if that happens I'd argue that the person specing out that machine did not give it nearly enough RAM cache (and go figure, this new tech can actually help there as well since it can supplement / increase effective RAM capacity if configured appropriately).
Regarding why I'm still testing NVMe parts to QD=256, it's mostly due to NVMe NAND part specs for some products stretching out that high. I have to at least match the workloads and depths that appear in product specs in order to confirm / verify performance to those levels.
I'm glad you saw benefit in the bridging the gap charts. Fortunately, my floppy drive still works and I was able to find a good disk! :). I had to go through three zip disks before finding one without the 'click of death'!
Holy smokes!
Hey great work
Holy smokes!
Hey great work here A, as usual.
Ditto that, Allyn: you are
Ditto that, Allyn: you are THE BEST!
> In the future, a properly tuned driver could easily yield results matching our ‘poll’ figures but without the excessive CPU overhead incurred by our current method of constantly asking the device for an answer.
Allyn,
The question that arose for me from your statement above
is this:
With so many multi-core CPUs proliferating,
would it help at all if a sysadmin could
“lock” one or more cores to the task
of processing the driver for this device?
The OS would then effectively “quarantine”
i.e. isolate that dedicated core from scheduling
any other normally executing tasks.
Each modern core also has large integrated caches,
e.g. L2 cache.
As such, it occurred to me that the driver
for this device would migrate its way
into the L2 cache of such a “dedicated” core
and help reduce overall latency.
Is this worth consideration, or am I out to lunch here?
Again, G-R-E-A-T review.
Locking a core to storage
Locking a core to storage purposes would sort of help, except you would then have to communicate across cores with each request, which may just be robbing Peter to pay Paul. The best solution is likely a hybrid between polling and IRQ, or polling that has waits pre-tuned to the device to minimize needlessly spinning the core. Server builders will likely not want to waste so many resources constantly polling the storage anyway, so the more efficient the better here.
for
for example:
http://www.tech-recipes.com/rx/37272/set-a-programs-affinity-in-windows-7-for-better-performance/
“Whether you want to squeak out some extra Windows 7 performance on your multi-core processor or run older programs flawlessly, you can set programs to run on certain cores in your processor. In certain situations this process can dramatically speed up your computer’s performance.”
I did some experimentation
I did some experimentation with setting of affinity on the server, and I was able to get latency improvements similar to polling, but there were other consequences such as not being able to reach the same IOPS levels per thread (typical IO requests can be processed by the kernel faster if the various related processes are allowed to span multiple threads). Room for improvement here but not as simple as an affinity tweak is all.
PCPER.com announces the first
PCPER.com announces the first ever CLONE AUCTION:
This auction will offer exact CLONES of Allyn Malventano,
complete with his entire computing experience intact.
Minimum starting bid is $1M USD. CASH ONLY.
Truly, Allyn, you are a treasure to the entire PC community.
THANKS!
I’m glad there was at least
I’m glad there was at least one comparison with the 960 pro, which is the most interesting graph in the article. I just wish there were more comparisons.
Your additional answers are
Your additional answers are coming soon!
Speaking of comparisons, I am
Speaking of comparisons, I am now very curious to know if Intel plans to develop an M.2 Optane SSD that uses all x4 PCIe 3.0 lanes instead of x2 PCIe 3.0 lanes.
Also, we need to take out a life insurance policy on Allyn, because we want him around to do his expert comparisons when the 2.5″ U.2 Optane SSD becomes available.
If Intel ultimately commits to manufacturing Optane in all of the following form factors, we should expect it to be nothing short of disruptive (pricing aside, for now):
(a) AIC (add-in card)
(b) M.2 NVMe
(c) U.2 2.5″
(d) DIMM
I would love to know that a modern OS can be hosted by the P4800X and all future successors!
PCIe 4.0 here we go!
Hello, Allyn!
Could you tell
Hello, Allyn!
Could you tell me, how did you manage to tweak FIO to perform polling for Optane P4800X under Windows?
I’ve read, how to do it under Linux only.
Thanks a lot in advance!
Regards,
Nick