Enterprise SSD Testing and Jargon
While enterprise SSDs typically share controller and flash memory architecture with consumer products shipping from the same company, there are some important differences to take note of. Enterprise units are typically equipped with high grade / more stringently binned flash memory components. Additional flash is installed proportional to the available capacity (overprovisioning) allowing for improved random write performance and greater endurance. Controller firmware is developed, optimized, and tuned for the type of workloads expected to be seen in its use. Enterprise parts go through more rigorous quality control testing.
If you think through the way you would test an enterprise SSD, you must first cast off the idea of running consumer-style benchmarks, which are typically performed on a partially filled drive and only apply their workload to a fraction of their available space. This is not what an enterprise SSD is designed for, and it is also worth considering should you want to purchase an enterprise SSD for a system that would only ever see consumer style workloads – the firmware tuning of enterprise parts may actually result in poorer performance in some consumer workloads. While consumer SSDs lean towards combining bursts of random writes into large sequential blocks, such operations cannot be sustained indefinitely without sacrificing long term performance. Enterprise SSDs take the ‘slow and steady’ approach when subjected to random writes, foregoing heavy write combination operations in the interest of maintaining more consistent IOPS and lower latencies over time. Lower sustained write latencies are vital to the datacenters employing these devices.
Transfer Size
If you have ever combed through the various reviews of a given enterprise SSD, you will first note how ‘generic’ the data is. You won’t see specific applications used very often – instead you will see only a hand full of small workloads applied. These workloads are common to the specifications seen across the industry, and typically consist of 4KB and 8KB transfer sizes for random operations and 128KB sizes for sequential operations. 4KB and 8KB cover the vast majority of OLTP (on-line transaction processing) and Database (typically 8K) usage scenarios. 128KB stemmed as the default maximum transfer size as it meshes neatly with the maximum IO size that many OS kernels will issue to a storage device. Little known fact: Windows Operating System kernels will not issue transfer sizes larger than 128KB to a storage device. If an application makes a single 1MB request (QD=1) through the Windows API, that request is broken up by the kernel into 8 128KB sequential requests that are issued to the storage device simultaneously (QD=8, or up to the Queue Depth limit for that device). I’m sorry to break it to you, but that means any benchmark apps you might have seen reporting results at block sizes >128KB were actually causing the kernel to issue 128KB requests at inflated queue depths.
Queue Depth
Alright, now with the transfer sizes out of the way, we come to another extremely important factor in testing these devices, and that is the Queue Depth (QD). Since the early SCSI and ATA (before SATA) days, a Command Queue was implemented. Hard Disk Drives that supported Native Command Queueing (NCQ) could coordinate with the host system and receive a short list of the IO requests that were pending and can even fulfill those requests out of the order received. This made access to the relatively slow disk much more efficient, as the drive knew what was coming as opposed to the old method, which issued IO requests sequentially. With optimized algorithms in the HDD firmware, NCQ can show boosts of up to 200% in random IOPS performance when compared to the same drive operating without a Queue. Fast forward to the introduction of SSDs. Instead of optimizing the read pattern of a HDD head pack, queueing was still useful as an SSD controller could leverage the queue to address multiple flash dies across multiple internal data channels simultaneously, greatly improving the possible throughput (especially with smaller random transfers). ATA / SATA / AHCI devices are limited to the legacy limit of 32 items in the queue (QD=32), but that is more than sufficient to saturate the now relatively limited maximum bandwidth of 6Gbit/sec. PCIe (AHCI) devices can go higher, and the NVMe specification was engineered to allow queue depths as high as 65536 (2^16), and can also support the same number of simultaneous queues! Having multiple queues is a powerful feature, as it helps to minimize excessive context switching across processor cores. Present day NVMe drivers typically assign one queue to each processor thread, minimizing the excessive resource / context switching that would occur if all cores and threads had to share a single large queue. Realize that there are only so many flash dies and so much communication bandwidth available on a given SSD, so we won’t see SSDs operating near the limits of these new higher queueing limits any time soon.
% Read / Write
Alright, so we have transfer sizes and queue depths, but we are not done. Another important variable is the percentage of reads vs. writes being applied to the device. A typical figure thrown around for databases is 70/30, meaning just under 3/4 of the workload consists of read operations. Other specs assume the ratio (4KB random write = 0/100, or 0% reads). Another spec typically on this line is ‘100%’, as in ‘100% 4KB random write’. In this context, ‘100%’ is not taking about a read or write percentage, it is referring to the fact that 100% of the drive span is being accessed during the test. The span of the drive represents the range of Logical Blocks (LBAs) presented to the host by the SSD. Remember that SSDs are overprovisioned and have more flash installed than they make available to the host. This is one of the tricks that enable an enterprise SSD to maintain higher sustained performance as compared to a consumer SSD. Consumer SSDs typically have 5-7% OP, while enterprise SSDs will tend to have higher values based on their intended purpose. ‘ECO’ units designed primarily for reads may run closer to consumer levels of OP, while other units designed to handle sustained small random writes could run at 50% or higher OP. Some enterprise SSDs come with special tools that enable the system builder to dial in their own OP values based on the intended workload and desired performance and endurance).
Latency
Latency is not a variable we put into our testing, but it is our most important result. IOPS alone does not tell the whole story, as many datacenter workloads are very sensitive to the latency of each IO request. Imagine if the system first needs one piece of data to then perform some mathematical work and then save the result back to the flash. This sequential operation spends much of its time waiting on the storage subsystem, and latencies represent the amount of time waited for each of those IO requests. The revision of testing and results covered in today's article are based on both average latency (next page) and fine-grained analysis of Latency Percentiles at PACED workloads (two pages ahead).
$0.50/GB is considered good?
$0.50/GB is considered good? Was this article written in 2005?
For pci-e ssds, that is
For pci-e ssds, that is considered good.
Yeah, for SATA SSDs anything
Yeah, for SATA SSDs anything <0.25/GB is pretty good, this is about twice that but you're also getting around twice the speeds.
Too expensive for me personally, but not unreasonable IMO.
Intel enterprise SSDs didn’t
Intel enterprise SSDs didn't launch until 2008, and did so at >$10/GB (>20x the cost).
That’s good progress, so they
That’s good progress, so they should begin to be viable around 2024
SSD market share has doubled
SSD market share has doubled for the past two years. It's expected to surpass HDD a lot sooner than 2024.
in 2005 SSDs would be more
in 2005 SSDs would be more like $50/GB 🙂
For that terrible 0.7 DWPD/5
For that terrible 0.7 DWPD/5 years, I would take 750 over this thing any day, performance wise it’s not even close to P3700/750.
Performance is no comparison,
Performance is no comparison, obviously. The point of this drive is cost, which is a fraction of all parts you mentioned.
Allyn, thank you, I really
Allyn, thank you, I really like the depth of your reviews, I’m actually learning stuff!
I do not find any mention of
I do not find any mention of capacitor for power loss writes. It’s a feature on which I place great importance.
Intel has among the highest,
Intel has among the highest, if not *the* highest power loss testing / qualification / reliability in the industry. It wasn't mentioned specifically because at this point it's just a given for their products. Here's a blurb from one of their product briefings:
They also bombard their drives with radiation (from an accelerator) until they hang, restart them, and ensure no data was corrupted. Their testing is pretty crazy, and that's why their products typically run higher in cost compared to others, but you get what you pay for.
Many think inflight data
Many think inflight data protection only as a safety issue, but it is also a significant performance issue. Without inflight data protection, use of inflight data must be turned off in the OS (it may be called something like write cache) to avoid data corruption in case of power failure, which in turn significantly lowers write speed.
So the point of inflight data protection or the lack of it should be hammered home in every review until it gets the warranted attention.
There are lots of layers of
There are lots of layers of what would/could be considered 'in-flight'. Even with all caching disabled, the mere fact that writes are queued could be considered so, as they are technically buffered by the kernel. To strip all the way down to zero buffering would reduce the performance of *most* SSDs to painful levels, as you'd have to limit to QD=1 and disable all OS buffers.
This protection, as defined by SSD makers, is a guarantee that the data that has been received by the controller at the point of power loss will be retained and available at next power up. Host / OS-side buffers will naturally not be included here.
Very excited about P3520
Very excited about P3520 especially in U.2 2.5″ format. This kind of pricing should really increase the viability (economically speaking) of big top-of-rack all flash arrays.
Not sure if you mentioned in the review but has Intel made any mention of dual-port U.2 version?
No mention of dual port for
No mention of dual port for this one, but I'd guess once 3D rolls out to other models in their lineup, it will include dual port.
So, let me make sure I
So, let me make sure I understand. This SSD is not tested against any other product, yet receives an editors choice. I smell something.
What you smell is no other
What you smell is no other products competing at this low of a cost/GB. Other companies are welcome to sample us their competing products (we ask them often).
It was pretty well-explained
It was pretty well-explained why…
what about raid 0 on 4 of
what about raid 0 on 4 of these
We are thinking of using the
We are thinking of using the P3520 or P3500 in Supermicro 48 bay nvme server. P3500 might be quicker but probably these will already move the bottleneck to the interface… Will have a look if you benchmarked the p3500 before…
Going to try out three of the
Going to try out three of the 1.2TB P3520’s for the hot tier in a three node hyperconverged environment. It’d be interesting to know what sort of benchmark would be relevant for comparison purposes on that kind of platform, since the workload mix could look like practically anything.
Yes it would, trying to set
Yes it would, trying to set up benchmarks simulating that kind of environment is not simple. Let us know how it goes as it could be very interesting.