Bridging the Gap and Polling vs. IRQ
Bridging the Gap
XPoint sits in the middle of this 'gap', but the gap is way larger than this slide demonstrates!
Some other charts I put together for this piece were to try and visualize just what the 10x reduction in latency means to computing as a whole. First let's start with where CPU and RAM sat in relation to older formats up to and including spinning rust:
The various caches are on the left (starting with a single CPU clock tick!), and spinning media sits on the right. Yes, this storage nut actually busted out a Zip drive and a floppy drive just to give you fine folks a point of reference for moving to HDDs of various speeds (all three are also on this chart). Note that HUGE gap in latency between RAM and the fastest possible HDD? That, my friends, is the pain we all used to have to endure. 100,000 of pain.
I've now added in NAND SSDs. SATA (dark grey) has been around for the bulk of maturation of NAND flash, which is why we see that area starting 10x quicker than HDDs but stretching to 100X as the technology matured. PCIe NVMe parts (brown) are a bit quicker, but the gains are not huge here because at the end of the day those parts contain the same NAND chips that still take some time to respond to requests. Even with the newest fire breathing NAND SSD, we still have a 1,000x latency gap to the RAM.
The P4800X puts us on that dark blue line, shifting yet another 10x and closing the gap even further than the Samsung 960 PRO that made up the brown line preceding it..
For a simple estimation of where typical storage API calls through the Windows kernel can take us today, I ran a test on the fastest RAM disk software I could find. That got us nearly another 10x and fully bridges the gap. This is around the latency we should be able to see from a XPoint DIMM, if not even quicker once operating systems are better equipped to handle such fast NV storage.
Polling vs. IRQ
Alright, time to come clean. As it turns out, Windows is not able to reach 10us latencies when performing IO requests via the typical method. This is not specific to just Windows though, as XPoint devices have driven both Intel and Micron to release 'poll mode' drivers for Linux in order to help things on that side of the fence. The catch to polling the device will be intimately familiar to anyone who used to deal with PIO (Programmed I/O) vs. DMA (Direct Memory Access) modes in the early Windows days. As with polling, PIO accesses the device directly and asks (repeatedly) if data is ready, which as you can imagine, is murder on the CPU thread performing the request. DMA was born to solve this issue and was a blessing to multi-threaded systems. Using DMA, devices could place ready data into the RAM directly and then issue an interrupt request (IRQ) to inform the CPU that data was ready. The CPU was free to do other things (or sit idle) until it was interrupted by the request completion, saving on resources but adding a small bit of latency to the tail end of each request.
Unfortunately, with Optane, that 'small bit of latency' becomes very significant and consumes a fair percentage of the total latency of each request. Remember that chart I showed at the beginning of this article?
Notice how that red 'software' bar was so large? Here's what that looks like in practice:
Reads
Writes
The primary spot this issue hits the P4800X is in very low QD requests, and while it does hurt performance significantly, pushing it below its '<10us' spec, it is still a very fast product. Consistency was only minimally impacted, but the IRQ servicing added another 4-5us to each IO. Who would have thought it would take something like this to shine a huge light on how long Windows takes to context switch a CPU thread and service a storage related interrupt?
Do note that the 'poll' results obtained for this article were still using the Windows kernel and Microsoft 'InBox' NVMe driver, but our IO completion routine was altered in such a way as to avoid interrupt requests from being generated during those requests. In the future, a properly tuned driver could easily yield results matching our 'poll' figures but without the excessive CPU overhead incurred by our modified method of constantly asking the device for an answer.
Thanks for the review(pre
Thanks for the review(pre consumer) of optane which I had been waiting for a while now. First none and now two, one on another site that I respect. Big thanks for the latency graphing from 1 clock cycle to a floppy drive. Very informative and something I was wondering about after getting a picture of intel placing the idea that it could be a go between storage and dimms. You test at very high queue depths but seem to state that some testing for a web server is not the best idea. Isnt it true that a webserver is the only place where high queue depths are to be seen? If so, and queue depths normally seen are much lower, where would one expect to see such high queue depths – or is it as you seem to say, its just a test to test?
Thanks for the article, I will have to wait for you to test again when you get one in your hands and likely find that consumers are at the door of another exponential shift like the one where ssd’s were used as boot drives when the price came down. We will more than likely start placing our Os’s on optane drives in our ssd system to gain additional quickness.
When they become available, Pcper “must” see what it will take to boot a computer in a second with a optane boot drive.SSD is 10 second possible. Nuff said.
Regarding high QD’s, there
Regarding high QD's, there are some rare instances, and it is possible that a web server could be hitting the flash so hard that it hits such a high QD, but if that happens I'd argue that the person specing out that machine did not give it nearly enough RAM cache (and go figure, this new tech can actually help there as well since it can supplement / increase effective RAM capacity if configured appropriately).
Regarding why I'm still testing NVMe parts to QD=256, it's mostly due to NVMe NAND part specs for some products stretching out that high. I have to at least match the workloads and depths that appear in product specs in order to confirm / verify performance to those levels.
I'm glad you saw benefit in the bridging the gap charts. Fortunately, my floppy drive still works and I was able to find a good disk! :). I had to go through three zip disks before finding one without the 'click of death'!
Holy smokes!
Hey great work
Holy smokes!
Hey great work here A, as usual.
Ditto that, Allyn: you are
Ditto that, Allyn: you are THE BEST!
> In the future, a properly tuned driver could easily yield results matching our ‘poll’ figures but without the excessive CPU overhead incurred by our current method of constantly asking the device for an answer.
Allyn,
The question that arose for me from your statement above
is this:
With so many multi-core CPUs proliferating,
would it help at all if a sysadmin could
“lock” one or more cores to the task
of processing the driver for this device?
The OS would then effectively “quarantine”
i.e. isolate that dedicated core from scheduling
any other normally executing tasks.
Each modern core also has large integrated caches,
e.g. L2 cache.
As such, it occurred to me that the driver
for this device would migrate its way
into the L2 cache of such a “dedicated” core
and help reduce overall latency.
Is this worth consideration, or am I out to lunch here?
Again, G-R-E-A-T review.
Locking a core to storage
Locking a core to storage purposes would sort of help, except you would then have to communicate across cores with each request, which may just be robbing Peter to pay Paul. The best solution is likely a hybrid between polling and IRQ, or polling that has waits pre-tuned to the device to minimize needlessly spinning the core. Server builders will likely not want to waste so many resources constantly polling the storage anyway, so the more efficient the better here.
for
for example:
http://www.tech-recipes.com/rx/37272/set-a-programs-affinity-in-windows-7-for-better-performance/
“Whether you want to squeak out some extra Windows 7 performance on your multi-core processor or run older programs flawlessly, you can set programs to run on certain cores in your processor. In certain situations this process can dramatically speed up your computer’s performance.”
I did some experimentation
I did some experimentation with setting of affinity on the server, and I was able to get latency improvements similar to polling, but there were other consequences such as not being able to reach the same IOPS levels per thread (typical IO requests can be processed by the kernel faster if the various related processes are allowed to span multiple threads). Room for improvement here but not as simple as an affinity tweak is all.
PCPER.com announces the first
PCPER.com announces the first ever CLONE AUCTION:
This auction will offer exact CLONES of Allyn Malventano,
complete with his entire computing experience intact.
Minimum starting bid is $1M USD. CASH ONLY.
Truly, Allyn, you are a treasure to the entire PC community.
THANKS!
I’m glad there was at least
I’m glad there was at least one comparison with the 960 pro, which is the most interesting graph in the article. I just wish there were more comparisons.
Your additional answers are
Your additional answers are coming soon!
Speaking of comparisons, I am
Speaking of comparisons, I am now very curious to know if Intel plans to develop an M.2 Optane SSD that uses all x4 PCIe 3.0 lanes instead of x2 PCIe 3.0 lanes.
Also, we need to take out a life insurance policy on Allyn, because we want him around to do his expert comparisons when the 2.5″ U.2 Optane SSD becomes available.
If Intel ultimately commits to manufacturing Optane in all of the following form factors, we should expect it to be nothing short of disruptive (pricing aside, for now):
(a) AIC (add-in card)
(b) M.2 NVMe
(c) U.2 2.5″
(d) DIMM
I would love to know that a modern OS can be hosted by the P4800X and all future successors!
PCIe 4.0 here we go!
Hello, Allyn!
Could you tell
Hello, Allyn!
Could you tell me, how did you manage to tweak FIO to perform polling for Optane P4800X under Windows?
I’ve read, how to do it under Linux only.
Thanks a lot in advance!
Regards,
Nick