
Intel’s architecture day press release contains the following storage goodness mixed within all of the talk about 3D chip packaging:
Did you catch that? 3D XPoint memory in DIMM form factor is expected to have an access latency of 350 nanoseconds! That’s down from 10 microseconds of the PCIe-based Optane products like Optane Memory and the P4800X. I realize those are just numbers, and showing a nearly 30x latency improvement may be easier visually, so here: Above is an edit to my Bridging the Gap chart from the P4800X review, showing where this new tech would fall in purple. That’s all we have to go on for now, but these are certainly exciting times. Consider that non-volatile storage latencies have improved by nearly 100,000x over the last decade, and are now within striking distance (less than 10x) of DRAM! Before you get too excited, realize that Optane DIMMs will be showing up in enterprise servers first, as they require specialized configurations to treat DIMM slots as persistent storage instead of DRAM. That said, I’m sure the tech will eventually trickle down to desktops in some form or fashion. If you’re hungry for more details on what makes 3D XPoint tick, check out how 3D XPoint works in my prior article.Memory and Storage: Intel discussed updates on Intel® Optane™ technology and the products based upon that technology. Intel® Optane™ DC persistent memory is a new product that converges memory-like performance with the data persistence and large capacity of storage. The revolutionary technology brings more data closer to the CPU for faster processing of bigger data sets like those used in AI and large databases. Its large capacity and data persistence reduces the need to make time-consuming trips to storage, which can improve workload performance. Intel Optane DC persistent memory delivers cache line (64B) reads to the CPU. On average, the average idle read latency with Optane persistent memory is expected to be about 350 nanoseconds when applications direct the read operation to Optane persistent memory, or when the requested data is not cached in DRAM. For scale, an Optane DC SSD has an average idle read latency of about 10,000 nanoseconds (10 microseconds), a remarkable improvement.2 In cases where requested data is in DRAM, either cached by the CPU’s memory controller or directed by the application, memory sub-system responsiveness is expected to be identical to DRAM (<100 nanoseconds).The company also showed how SSDs based on Intel’s 1 Terabit QLC NAND die move more bulk data from HDDs to SSDs, allowing faster access to that data.
Allyn,
Can we anticipate a
Allyn,
Can we anticipate a capability to do a
fresh OS install to a region of NVDIMMs
like the ones you describe?
http://supremelaw.org/patents/bios.enhancements/provisional.application.1.htm
http://supremelaw.org/patents/bios.enhancements/provisional.application.2.htm
Such a region seems like an ideal place
to host an OS e.g. by enhancing a BIOS/UEFI
subsystem to support a “Format RAM” option.
BEST WAY is to implement this option so that
it’s entirely transparent to Windows install logic:
Windows install would NOT even know that the
target C: partition is a set of Non-Volatile DIMMs.
Perhaps a group like JEDEC would consider
setting standards for hosting any OS
in such an NVDIMM “region”.
I suspect Optane DIMMs (in
I suspect Optane DIMMs (in current enterprise form) are meant to be managed by the software layer that is accessing them, similar to SMR HDDs. I suppose it would be possible to create a block storage device driver and then install an OS to it, but that's not the intended use case at present, and OS hosting duties are served reasonably well on modern NAND / Optane PCIe devices. We're already approaching diminishing performance returns at those levels anyway. Taking real advantage of what you describe would require a complete restructuring of how an OS accesses storage – otherwise, you are just throwing away much of the latency benefits in the translation to relatively high overhead methods of accessing the media.
“Taking real advantage of
Agreed. That’s also probably evident by the above latency Percentile graph showing a RamDisk latency significantly worse than DRAM. Since RamDisk’s use DRAM, this kind of shows the extra latency involved with doing storage I/O versus just reading main memory/RAM.
How about this sequence:
(1)
How about this sequence:
(1) configure a ramdisk in the uppermost 64GB of DRAM;
(2) run “Migrate OS” using Partition Wizard;
(3) re-boot into the migrated OS resident in the ramdisk.
The only other BIG changes are modifications
to the motherboard BIOS/UEFI subsystem,
to detect and boot from this ramdisk OS; and,
a general purpose device driver like the
one that supports RamDisk Plus from http://www.superspeed.com
(my favorite, as you know).
Wendell describes 128GB in a Threadripper system here:
https://www.youtube.com/watch?v=HDLhdKmV3Vo
The original Provisional Patent Application assumed
volatile DRAM, which had its own special problems
of course e.g. at SHUTDOWN.
It seems to me that the availability of non-volatile
memory on a DDR4 bus is the really BIG CHANGE
that obtains with Optane DIMMs.
MAX HEADROOM with 4 x M.2 SSDs in RAID-0
installed on the ASRock Ultra Quad AIC is 15,753.6 MBps.
By comparison, DDR4-3200 x 8 = 25,600 MBps, and
even faster DDR4 have been announced by G.SKILL et al.
DDR4-4000 X 8 = 32,000 MBps.
However, I don’t really see the need for a
“complete restructuring of how an OS accesses storage”.
Honestly, without doing the appropriate experiments,
I suspect that the latter characterization may be
closer to a “straw man”.
Commercial device drivers are already available
in software like DATARAM: http://www.dataram.com/
In the interests of computer science (if nothing else),
I would certainly like to see this experiment
performed on a TR system like Wendell’s.
Of course, we’ve need to have access to the BIOS/UEFI
code, in order to compile and flash an experimental version
that recognizes the new location of the bootstrap loader, etc.
Maybe we could submit a proposal to Ryan after he
starts working at Intel; it certainly has the
resources necessary to do this experiment. And,
such an experiment seems to fit his job description
there.
If not Intel, then maybe AMD?
Thanks for listening! /s/ Paul
On a TR system with LOTSA
On a TR system with LOTSA DDR4, LOTSA possibilities
come to my mind e.g.:
(a) it may be possible to dedicate one or more CPU cores
to the ramdisk device driver: that way, the raw code
would “migrate” into each core’s internal caches
for extra computational speed;
(b) starting with 128GB, the lowest 4GB of DRAM addresses
could serve as “initialization” RAM; after doing the
“Migrate OS” step, the entire 124GB remaining could be
formatted as a single C: partition (which is a very
common practice on many PCs);
(c) after the “Migrate OS” step is completed successfully,
there are 2 copies of the OS i.e. “mirrored” —
one of which is hosted on conventional Nand Flash SSDs and
one of which is hosted in the ramdisk C: partition;
(d) if/when the ramdisk OS develops problems e.g. virus,
then simply boot from the conventional SSDs and restore a
valid drive image to the ramdisk C: partition.
Our main workstation now has 4 copies of our Windows OS,
restored to the primary NTFS partition on 4 different drives. With this setup, it has been very easy to
change the boot device in the BIOS whenever we need
to restore a drive image to the main C: partition.
We developed the latter setup chiefly because
the CD-ROM software for restoring a drive image
is terribly slow and time-consuming.
Re: (b) starting with 128GB,
Re: (b) starting with 128GB, the lowest 4GB of DRAM addresses
could serve as “initialization” RAM;
That is an extreme case:
the amount of conventional RAM
that is NOT assigned to the ramdisk
is a design decision dictated
by the intended use case(s).
Clearly, the amount of RAM
assigned to the ramdisk and
the amount of RAM NOT assigned
to the ramdisk are in a
“zero sum” relationship.
FYI: a summary page
FYI: a summary page published by computer scientists
at North Carolina State University, Computer Science Dept.:
http://moss.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc17poster2.pdf
This research article is a
This research article is a little dated, but
it does cover a number of related issues with NVM:
https://www.researchgate.net/publication/270282794_Opportunities_for_Nonvolatile_Memory_Systems_in_Extreme-Scale_High_Performance_Computing
I believe Microsoft has
I believe Microsoft has designed Server 2019 to use these as Cache Drives for Storage Spaces Direct. They called it Persistent Memory or NVDIMM-N. And they talked about it at this year’s Ignite.
https://myignite.techcommunity.microsoft.com/sessions/65882
Allyn, Your chart shows DRAM
Allyn, Your chart shows DRAM ~17M IOPS.
Off the top of your head, how much does
that latency measure vary, in your experience?
The blue line appears to be rather constant
i.e. no tapering off at the top of that chart
(between 90% and 100%: compare the RAMDISK
green line).
The ‘~’ figures are
The '~' figures are approximated based on the documented latency of those parts to the far left of the chart. I was basing this off of the number of clock cycles to access various levels of cache, etc.