You may have seen a wave of Micron 3D NAND news posts these past few days, and while many are repeating the 11-month old news with talks of 10TB/3.5TB on a 2.5"/M.2 form factor SSDs, I'm here to dive into the bigger implications of what the upcoming (and future) generation of Intel / Micron flash will mean for SSD performance and pricing.
Remember that with the way these capacity increases are going, the only way to get a high performance and high capacity SSD on-the-cheap in the future will be to actually get those higher capacity models. With such a large per-die capacity, smaller SSDs (like 128GB / 256GB) will suffer significantly slower write speeds. Taking this upcoming Micron flash as an example, a 128GB SSD will contain only four flash memory dies, and as I wrote about back in 2014, such an SSD would likely see HDD-level sequential write speeds of 160MB/sec. Other SSD manufacturers already recognize this issue and are taking steps to correct it. At Storage Visions 2016, Samsung briefed me on the upcoming SSD 750 Series that will use planar 16nm NAND to produce 120GB and 250GB capacities. The smaller die capacities of these models will enable respectable write performance and will also enable them to discontinue their 120GB 850 EVO as they transition that line to higher capacity 48-layer VNAND. Getting back to this Micron announcement, we have some new info that bears analysis, and that pertains to the now announced page and block size:
-
256Gb MLC: 16KB Page / 16MB Block / 1024 Pages per Block
-
384Gb TLC: 16KB Page / 24MB Block / 1536 Pages per Block
To understand what these numbers mean, using the MLC line above, imagine a 16MB CD-RW (Block) that can write 1024 individual 16KB 'sessions' (Page). Each 16KB can be added individually over time, and just like how files on a CD-RW could be modified by writing a new copy in the remaining space, flash can do so by writing a new Page and ignoring the out of date copy. Where the rub comes in is when that CD-RW (Block) is completely full. The process at this point is very similar actually, in that the Block must be completely emptied before the erase command (which wipes the entire Block) is issued. The data has to go somewhere, which typically means writing to empty blocks elsewhere on the SSD (and in worst case scenarios, those too may need clearing before that is possible), and this moving and erasing takes time for the die to accomplish. Just like how wiping a CD-RW took a much longer than writing a single file to it, erasing a Block takes typically 3-4x as much time as it does to program a page.
With that explained, of significance here are the growing page and block sizes in this higher capacity flash. Modern OS file systems have a minimum bulk access size of 4KB, and Windows versions since Vista align their partitions by rounding up to the next 2MB increment from the start of the disk. These changes are what enabled HDDs to transition to Advanced Format, which made data storage more efficient by bringing the increment up from the 512 Byte sector up to 4KB. While most storage devices still use 512B addressing, it is assumed that 4KB should be the minimum random access seen most of the time. Wrapping this all together, the Page size (minimum read or write) is 16KB for this new flash, and that is 4x the accepted 4KB minimum OS transfer size. This means that power users heavy on their page file, or running VMs, or any other random-write-heavy operations being performed over time will have a more amplified effect of wear of this flash. That additional shuffling of data that must take place for each 4KB write translates to lower host random write speeds when compared to lower capacity flash that has smaller Page sizes closer to that 4KB figure.
A rendition of 3D IMFT Floating Gate flash, with inset pulling back some of the tunnel oxide layer to show the location of the floating gate. Pic courtesy Schiltron.
Fortunately for Micron, their choice to carry Floating Gate technology into their 3D flash has netted them some impressive endurance benefits over competing Charge Trap Flash. One such benefit is a claimed 30,000 P/E (Program / Erase) cycle endurance rating. Planar NAND had dropped to the 3,000 range at its lowest shrinks, mainly because there was such a small channel which could only store so few electrons, amplifying the (negative) effects of electron leakage. Even back in the 50nm days, MLC ran at ~10,000 cycle endurance, so 30,000 is no small feat here. The key is that by using that same Floating Gate tech so good at controlling leakage for planar NAND on a new 3D channel that can store way more electrons enables excellent endurance that may actually exceed Samsung's Charge Trap Flash equipped 3D VNAND. This should effectively negate the endurance hit on the larger Page sizes discussed above, but the potential small random write performance hit still stands, with a possible remedy being to crank up the Over-Provisioning of SSDs (AKA throwing flash at the problem). Higher OP means less active pages per block and a reduction in the data shuffling forced by smaller writes.
A 25nm flash memory die. Note the support logic (CMOS) along the upper left edge.
One final thing helping out Micron here is that their Floating Gate design also enables a shift of 75% of the CMOS circuitry to a layer *underneath* the flash storage array. This logic is typically part of what you see 'off to the side' of a flash memory die. Layering CMOS logic in such a way is likely thanks to Intel's partnership and CPU development knowledge. Moving this support circuitry to the bottom layer of the die makes for less area per die dedicated to non-storage, more dies per wafer, and ultimately lower cost per chip/GB.
Samsung's Charge Trap Flash, shown in both planar and 3D VNAND forms.
One final thing before we go. If we know anything about how the Intel / Micron duo function, it is that once they get that freight train rolling, it leads to relatively rapid advances. In this case, the changeover to 3D has taken them a while to perfect, but once production gains steam, we can expect to see some *big* advances. Since Samsung launched their 3D VNAND their gains have been mostly iterative in nature (24, 32, and most recently 48). I'm not yet at liberty to say how the second generation of IMFT 3D NAND will achieve it, but I can say that it appears the next iteration after this 32-layer 256Gb (MLC) /384Gb (TLC) per die will *double* to 512Gb/768Gb (you are free to do the math on what that means for layer count). Remember back in the day where Intel launched new SSDs at a fraction of the cost/GB of the previous generation? That might just be happening again within the next year or two.
Nice write up! It’s very
Nice write up! It’s very exciting to see solid state storage technology maturing and becoming ubiquitous. Spinning platters may still have the lowest cost per GB but it looks like that may end soon.
Soon? Probably not. They’ve
Soon? Probably not. They’ve still got more than an order of magnitude to go, and HDDs still have some advancements on the horizon as well.
As long as this affects
As long as this affects worldwide prices directly and I’ll be able to buy a 512GB SSD for just 52$ in the next four years – I’m fine with this. Otherwise, Samsung did it first, so…”FIRST!” (c) Samsung
How important is the write
How important is the write speed to most consumer applications? It doesn’t seem like lower write speed will be that big of an issue, except for some specialized applications. Also, has there been any more info about Intel’s X-point technology? If that comes out in a reasonable time frame and performs as claimed, then it may take the small SSD market. It would be more expensive, but the performance could be significantly higher. It isn’t going to make as big of a difference as the jump from hard drive to flash, but people seem to be willing to pay more for high performance SSDs even though the human noticable differences are small. Small flash based devices may only be for very low budget solutions where buyers may not care that much about write speed. They may still be used for a lot of OEM systems unfortunately.
The need to keep write speeds
The need to keep write speeds on the high end is more for those cases where sequential is overwriting previously randomly written, or areas frequently randomly written. In those cases, we get only a fraction of the max rated (sequential) write speeds. I've seen this sort of thing happen when a consumer SSD was left to see a trickle of small writes over a long period of time (normal Windows operation), and then see horrible sequential write speeds to that previously fragmented flash. Larger capacity SSDs have more dies that can be simultaneously busy with write operations, so the load is spread thinner and we get higher overall write speeds even in those worst case scenarios. You don't have to be a power user to get into this state, and due to the time / pacing required to replicate, it's difficult to test for.
We have no further details on XPoint yet but a few of those dies would make for an easy solution to buffer chunks of small writes while keeping a bunch of very high capacity TLC dies for the low cost flash that can still be quickly read from.
I guess an x-point/flash
I guess an x-point/flash hybrid device could be a good combination, even for larger devices. I haven’t really considered the hybrid flash hard drives to be a good idea though. It seems much better to set them up separately for hard drives. With a hybrid flash/x-point device, the x-point could handle frequently written data without using up flash program/erase cycles. It is unclear if we will see such a device though. We may just have small, low end flash drives with x-point hybrid devices being a high-end enthusiast option; it depends on how much x-point ends up costing.
They seem to be headed more towards making hybrid DRAM modules of some type rather than hybrid flash devices. I am not sure what would be best to provide expandable memory to something like an HBM APU though. I have wondered if it would be best to make a socketed HMC or PCI-e like interface using an m.2 like form factor for adding both DRAM and non-volatile storage. With having possibly many gigabytes of memory on package, off package DRAM gets pushed out further in the memory hierarchy, so it may end up being treated more like an SSD used for swap. HMC doesn’t seem to be designed to go through a slot or socket conector though. Perhaps they could mount such memory right next to the processor socket, in the area under the heat sink, since an m.2 style form factor mounts flat to the board. This would keep the traces very short, which may allow HMC style connections. Hopefully we will not have too many DOA standards like SATA Express seems to be. It has been a long time since we have had a major change in organization and I think it is overdue.
Wild speculation, I know, but some of the tech coming out is the most interesting changes we have had in a long time. The last real change was when they moved the memory controller directly onto the processor die, and that seems like a minor change in comparison.
I’d like to see the DRAM
I’d like to see the DRAM module/s getting a large amount of XPoint to host the OS paging file, and that should reduce the need for the slower NAND memory for that OS functionality. Maybe the XPoint on a DIMM could host other things also, but for the OS page files alone that could enable much faster operation when the system RAM becomes fully loaded and the page faults start to have an negative influence on system performance. If you have ever worked on graphics workloads the system memory can become easily filled to overflowing with the graphic application’s requests for more memory and the system starting to thrash with very little in the way of system responsiveness!
Having the nonvolatile XPoint on the DRAM module with some in the background ability to swap the XPoint hosted page files in and out of DRAM without adding to any main DRAM system memory channel bandwidth usage between the processor and the DRAM would be helpful as that would allow the processor/OS to dispatch paging requests and do other work while the XPoint/DRAM’s subsystems would be responsible for moving the data to and from the XPoint and DRAM without taxing the processor/DRAM memory channel’s available bandwidth.
Certainly having XPoint for a larger amount of on SSD cache in addition to the Slower NAND will probably be one of the first usages for the newer Micron/Intel XPoint NVM memory, and it will give the SSD’s controller more time to allow for better NAND management to keep the SSD from becoming too fragmented while still being able to maintain better R/W speeds. A sufficiently large amount of XPoint Cache on an SSD will allow for much faster random reads and writes with the SSD’s controller able to quickly buffer the writes while prioritizing the read requests and saving the writing to NAND until the SSD system has available resources so as to not degrade the Read performance. With a large enough amount of ON SSD XPoint Cache relative to the system needs the SSD’s NAND could be better managed with the XPoint taking much of the load and wear off the NAND that does not have the endurance or speed of XPoint.