Introduction and Specifications
XPoint Tested. Finally!
XPoint. Optane. QuantX. We've been hearing these terms thrown around for two years now. A form of 3D stackable non-volatile memory that promised 10x the density of DRAM and 1000x the speed and endurance of NAND. These were bold statements, and over the following months, we would see them misunderstood and misconstrued by many in the industry. These misconceptions were further amplified by some poor demo choices on the part of Intel (fortunately countered by some better choices made by Micron). Fortunately cooler heads prevailed as Jim Handy and other industry analysts helped explain that a 1000x improvement at the die level does not translate to the same improvement at the device level, especially when the first round of devices must comply with what will soon become a legacy method of connecting a persistent storage device to a PC.
Did I just suggest that PCIe 3.0 and the NVMe protocol – developed just for high-speed storage, is already legacy tech? Well, sorta.
That 'Future NVM' bar at the bottom of that chart there was a 2-year old prototype iteration of what is now Optane. Note that while NVMe was able to shrink down the yellow bar a bit, as you introduce faster and faster storage, the rest of the equation (meaning software, including the OS kernel) starts to have a larger and larger impact on limiting the ultimate speed of the device.
NAND Flash simplified schematic (via Wikipedia)
Before getting into the first retail product to push all of these links in the storage chain to the limit, let's explain how XPoint works and what makes it faster. Taking random writes as an example, NAND Flash (above) must program cells in pages and erase cells in blocks. As modern flash has increased in capacity, the sizes of those pages and blocks have scaled up roughly proportionally. At present day we are at pages >4KB and block sizes in the megabytes. When it comes to randomly writing to an already full section of flash, simply changing the contents of one byte on one page requires the clearing and rewriting of the entire block. The difference between what you wanted to write and what the flash had to rewrite to accomplish that operation is called the write amplification factor. It's something that must be dealt with when it comes to flash memory management, but for XPoint it is a completely different story:
XPoint is bit addressible. The 'cross' structure means you can select very small groups of data via Wordlines, with the ultimate selection resolving down to a single bit.
Since the programmed element effectively acts as a resistor, its output is read directly and quickly. Even better – none of that write amplification nonsense mentioned above applies here at all. There are no pages or blocks. If you want to write a byte, go ahead. Even better is that the bits can be changed regardless of their former state, meaning no erase or clear cycle must take place before writing – you just overwrite directly over what was previously stored. Is that 1000x faster / 1000x more write endurance than NAND thing starting to make more sense now?
Ok, with all of the background out of the way, let's get into the meat of the story. I present the P4800X:
Yes I know, don't tell me, that's not a photo that I took. Turns out that Intel only has enough of these to currently sample those who are actually developing software that can take full advantage of the new tech (like VMware). When I was at Intel's Folsom campus a few weeks back, I was shown a server loaded with a P4800X and a P3700 for comparison. For the past few weeks, I have had remote access to this server and have tested the P4800X with extreme prejudice.
(Editors Note: It is worth pointing out that this testing method is not ideal and is not something we would have recommend or suggested to Intel. However, with the alternative being not testing the product at all, we decided it was worth telling the story of Optane to our readers regardless of the testing process involved. As a sanity check, since Intel had P3700 in the remote system we double checked performance on it and a local system with the exact same processor and P3700 and results were within a 1% margin, giving us as good of an indication as any that nothing "funny" was going on with the test system.)
Despite repeated requests, Intel was unwilling to share the complete datasheet with us. Above are the specs listed on their abbreviated product brief.
Thanks for the review(pre
Thanks for the review(pre consumer) of optane which I had been waiting for a while now. First none and now two, one on another site that I respect. Big thanks for the latency graphing from 1 clock cycle to a floppy drive. Very informative and something I was wondering about after getting a picture of intel placing the idea that it could be a go between storage and dimms. You test at very high queue depths but seem to state that some testing for a web server is not the best idea. Isnt it true that a webserver is the only place where high queue depths are to be seen? If so, and queue depths normally seen are much lower, where would one expect to see such high queue depths – or is it as you seem to say, its just a test to test?
Thanks for the article, I will have to wait for you to test again when you get one in your hands and likely find that consumers are at the door of another exponential shift like the one where ssd’s were used as boot drives when the price came down. We will more than likely start placing our Os’s on optane drives in our ssd system to gain additional quickness.
When they become available, Pcper “must” see what it will take to boot a computer in a second
with a optane boot drive.SSD is 10 second possible. Nuff said.
Regarding high QD’s, there
Regarding high QD's, there are some rare instances, and it is possible that a web server could be hitting the flash so hard that it hits such a high QD, but if that happens I'd argue that the person specing out that machine did not give it nearly enough RAM cache (and go figure, this new tech can actually help there as well since it can supplement / increase effective RAM capacity if configured appropriately).
Regarding why I'm still testing NVMe parts to QD=256, it's mostly due to NVMe NAND part specs for some products stretching out that high. I have to at least match the workloads and depths that appear in product specs in order to confirm / verify performance to those levels.
I'm glad you saw benefit in the bridging the gap charts. Fortunately, my floppy drive still works and I was able to find a good disk! :). I had to go through three zip disks before finding one without the 'click of death'!
Hey great work
Hey great work here A, as usual.
Ditto that, Allyn: you are
Ditto that, Allyn: you are THE BEST!
> In the future, a properly tuned driver could easily yield results matching our ‘poll’ figures but without the excessive CPU overhead incurred by our current method of constantly asking the device for an answer.
The question that arose for me from your statement above
With so many multi-core CPUs proliferating,
would it help at all if a sysadmin could
“lock” one or more cores to the task
of processing the driver for this device?
The OS would then effectively “quarantine”
i.e. isolate that dedicated core from scheduling
any other normally executing tasks.
Each modern core also has large integrated caches,
e.g. L2 cache.
As such, it occurred to me that the driver
for this device would migrate its way
into the L2 cache of such a “dedicated” core
and help reduce overall latency.
Is this worth consideration, or am I out to lunch here?
Again, G-R-E-A-T review.
Locking a core to storage
Locking a core to storage purposes would sort of help, except you would then have to communicate across cores with each request, which may just be robbing Peter to pay Paul. The best solution is likely a hybrid between polling and IRQ, or polling that has waits pre-tuned to the device to minimize needlessly spinning the core. Server builders will likely not want to waste so many resources constantly polling the storage anyway, so the more efficient the better here.
“Whether you want to squeak out some extra Windows 7 performance on your multi-core processor or run older programs flawlessly, you can set programs to run on certain cores in your processor. In certain situations this process can dramatically speed up your computer’s performance.”
I did some experimentation
I did some experimentation with setting of affinity on the server, and I was able to get latency improvements similar to polling, but there were other consequences such as not being able to reach the same IOPS levels per thread (typical IO requests can be processed by the kernel faster if the various related processes are allowed to span multiple threads). Room for improvement here but not as simple as an affinity tweak is all.
PCPER.com announces the first
PCPER.com announces the first ever CLONE AUCTION:
This auction will offer exact CLONES of Allyn Malventano,
complete with his entire computing experience intact.
Minimum starting bid is $1M USD. CASH ONLY.
Truly, Allyn, you are a treasure to the entire PC community.
I’m glad there was at least
I’m glad there was at least one comparison with the 960 pro, which is the most interesting graph in the article. I just wish there were more comparisons.
Your additional answers are
Your additional answers are coming soon!
Speaking of comparisons, I am
Speaking of comparisons, I am now very curious to know if Intel plans to develop an M.2 Optane SSD that uses all x4 PCIe 3.0 lanes instead of x2 PCIe 3.0 lanes.
Also, we need to take out a life insurance policy on Allyn, because we want him around to do his expert comparisons when the 2.5″ U.2 Optane SSD becomes available.
If Intel ultimately commits to manufacturing Optane in all of the following form factors, we should expect it to be nothing short of disruptive (pricing aside, for now):
(a) AIC (add-in card)
(b) M.2 NVMe
(c) U.2 2.5″
I would love to know that a modern OS can be hosted by the P4800X and all future successors!
PCIe 4.0 here we go!
Could you tell
Could you tell me, how did you manage to tweak FIO to perform polling for Optane P4800X under Windows?
I’ve read, how to do it under Linux only.
Thanks a lot in advance!