Enterprise SSD Testing and Jargon
While enterprise SSDs typically share controller and flash memory architecture with consumer products shipping from the same company, there are some important differences to take note of. Enterprise units are typically equipped with high grade / more stringently binned flash memory components. Additional flash is installed proportional to the available capacity (overprovisioning) allowing for improved random write performance and greater endurance. Controller firmware is developed, optimized, and tuned for the type of workloads expected to be seen in its use. Enterprise parts go through more rigorous quality control testing.
If you think through the way you would test an enterprise SSD, you must first cast off the idea of running consumer-style benchmarks, which are typically performed on a partially filled drive and only apply their workload to a fraction of their available space. This is not what an enterprise SSD is designed for, and it is also worth considering should you want to purchase an enterprise SSD for a system that would only ever see consumer style workloads – the firmware tuning of enterprise parts may actually result in poorer performance in some consumer workloads. While consumer SSDs lean towards combining bursts of random writes into large sequential blocks, such operations cannot be sustained indefinitely without sacrificing long term performance. Enterprise SSDs take the ‘slow and steady’ approach when subjected to random writes, foregoing heavy write combination operations in the interest of maintaining more consistent IOPS and lower latencies over time. Lower sustained write latencies are vital to the datacenters employing these devices.
Transfer Size
If you have ever combed through the various reviews of a given enterprise SSD, you will first note how ‘generic’ the data is. You won’t see specific applications used very often – instead you will see only a hand full of small workloads applied. These workloads are common to the specifications seen across the industry, and typically consist of 4KB and 8KB transfer sizes for random operations and 128KB sizes for sequential operations. 4KB and 8KB cover the vast majority of OLTP (on-line transaction processing) and Database (typically 8K) usage scenarios. 128KB stemmed as the default maximum transfer size as it meshes neatly with the maximum IO size that many OS kernels will issue to a storage device. Little known fact: Windows Operating System kernels will not issue transfer sizes larger than 128KB to a storage device. If an application makes a single 1MB request (QD=1) through the Windows API, that request is broken up by the kernel into 8 128KB sequential requests that are issued to the storage device simultaneously (QD=8, or up to the Queue Depth limit for that device). I’m sorry to break it to you, but that means any benchmark apps you might have seen reporting results at block sizes >128KB were actually causing the kernel to issue 128KB requests at inflated queue depths.
Queue Depth
Alright, now with the transfer sizes out of the way, we come to another extremely important factor in testing these devices, and that is the Queue Depth (QD). Since the early SCSI and ATA (before SATA) days, a Command Queue was implemented. Hard Disk Drives that supported Native Command Queueing (NCQ) could coordinate with the host system and receive a short list of the IO requests that were pending and can even fulfill those requests out of the order received. This made access to the relatively slow disk much more efficient, as the drive knew what was coming as opposed to the old method, which issued IO requests sequentially. With optimized algorithms in the HDD firmware, NCQ can show boosts of up to 200% in random IOPS performance when compared to the same drive operating without a Queue. Fast forward to the introduction of SSDs. Instead of optimizing the read pattern of a HDD head pack, queueing was still useful as an SSD controller could leverage the queue to address multiple flash dies across multiple internal data channels simultaneously, greatly improving the possible throughput (especially with smaller random transfers). ATA / SATA / AHCI devices are limited to the legacy limit of 32 items in the queue (QD=32), but that is more than sufficient to saturate the now relatively limited maximum bandwidth of 6Gbit/sec. PCIe (AHCI) devices can go higher, and the NVMe specification was engineered to allow queue depths as high as 65536 (2^16), and can also support the same number of simultaneous queues! Having multiple queues is a powerful feature, as it helps to minimize excessive context switching across processor cores. Present day NVMe drivers typically assign one queue to each processor thread, minimizing the excessive resource / context switching that would occur if all cores and threads had to share a single large queue. Realize that there are only so many flash dies and so much communication bandwidth available on a given SSD, so we won’t see SSDs operating near the limits of these new higher queueing limits any time soon.
% Read / Write
Alright, so we have transfer sizes and queue depths, but we are not done. Another important variable is the percentage of reads vs. writes being applied to the device. A typical figure thrown around for databases is 70/30, meaning just under 3/4 of the workload consists of read operations. Other specs assume the ratio (4KB random write = 0/100, or 0% reads). Another spec typically on this line is ‘100%’, as in ‘100% 4KB random write’. In this context, ‘100%’ is not taking about a read or write percentage, it is referring to the fact that 100% of the drive span is being accessed during the test. The span of the drive represents the range of Logical Blocks (LBAs) presented to the host by the SSD. Remember that SSDs are overprovisioned and have more flash installed than they make available to the host. This is one of the tricks that enable an enterprise SSD to maintain higher sustained performance as compared to a consumer SSD. Consumer SSDs typically have 5-7% OP, while enterprise SSDs will tend to have higher values based on their intended purpose. ‘ECO’ units designed primarily for reads may run closer to consumer levels of OP, while other units designed to handle sustained small random writes could run at 50% or higher OP. Some enterprise SSDs come with special tools that enable the system builder to dial in their own OP values based on the intended workload and desired performance and endurance).
Latency
Latency is not a variable we put into our testing, but it is our most important result. IOPS alone does not tell the whole story, as many datacenter workloads are very sensitive to the latency of each IO request. Imagine if the system first needs one piece of data to then perform some mathematical work and then save the result back to the flash. This sequential operation spends much of its time waiting on the storage subsystem, and latencies represent the amount of time waited for each of those IO requests. The version of testing and results covered in today's article is based on average latency, however we are collecting detailed data from our current tests and will be focusing on and charting that statistical data in future reviews.
@Allyn Any chance that Intel
@Allyn Any chance that Intel will release a 800 GB version of the P3608, in order to lower it to a more affordable price point for the enthusiast?
More than likely they will
More than likely they will not, as the P3608 is meant to get higher densities into smaller spaces. It would also limit each 'half' to only 400GB, which would offer limited performance that would be close to that of the 800GB P3600 in the first place.
Regarding use by enthusiasts, I would highly recommend going the new 800GB SSD 750 route as (or a pair of 400's in RST RAID). The 750 Series uses the same controller but has its enterprise temperature monitoring features disabled – those features interfered with many desktop class BIOS and caused memory contention / address conflict issues. The firmare is also more optimized for desktop / consumer workloads.
I’m not even mad I can’t
I’m not even mad I can’t afford one.
I’m just sitting here admiring the nice pictures. Making neat graphs like these should be performance art with tours of live shows. You rock, Allyn!
Thanks for the kudos! We’re
Thanks for the kudos! We're working hard on how we present this data, and will continue to improve on these charts.
Interesting review, but not
Interesting review, but not exactly a PC part. Giving it a gold award seems a bit pointless. No PC enthusiast should buy this part, or really anything in the DC P3xxx line. It is interesting to know what is going on in the enterprise market, since that tech will filter down to the PC market eventually, if it is something that is actually useful to the PC market. I don’t know if devices like this will have a place in the PC market before it is displaced by other technology though.
I realize that the site is
I realize that the site is called PC Perspective, but this is an enterprise review. A handfull of sites cover both PC and enterprise storage devices. For the moment, we are doing it without spinning off another site or brand. With Intel's RST for Z170 NVMe devices and RSTe to bridge both halves of this device, you're correct that it may filter down to the PC market. Actually, the same RST tech can currently RAID SSD 750's (not RAIDed for that piece, but now it is possible).
OK- I really need your help.
OK- I really need your help. I have a 1.6TB P3608 and have it installed on an X99 chipset motherboard. I have tried every version of RSTe I can find and I cant for the life of me get the P3608 to detect in RSTe. The P3608 shows up fine in Disk Manager and I can even set up a RAID from within Disk Manager (albeit at the expense of being able to TRIM the array).
Can you please explain which version of RSTe driver and UI you used?
Any word on pricing? Not that
Any word on pricing? Not that I could ever afford one, I’m pretty sure its more expensive than the rest of my PC. (PS, I know its for data centers and not for a regular enthusiast, but damn I want it so bad).
Nothing new when you look at
Nothing new when you look at “ordinary” P3600. I was expecting lame PLX chip as it is much cheaper way than actually making two SSD working in tandem without lane switcher on same card. Sadly no hardware RoC is available for NVMe ATM.
While review is interesting from raw performance standpoint, it is not relevant at all to PC market as 3608 is purely server grade, industrial storage that will never reach enthusiast market – at least not in this shape. More interested in what you hinted above about seriously more expensive P3700.
Allyn have you tested that setup in RAID1/10 (if you have 2)? Would be interested in that, how much hit NVMe gets on writes with this setup vs classic NAND AHCI. R0 is pointless exercise from my point of view. Redundancy over performance any day of the week.
We reviewed the P3700 before.
We reviewed the P3700 before. I've run the workload on the P3608 and both P3700's in a RAID-0. RSTe had no issue pegging all drives on sequentials (10 GB/sec reads), but you need to throw more cores at it for random IO as compared to addressing the drives individually. More detail on the level of overhead in the next piece covering RSTe, as there is a lot of data I need to compile for it. Might do a RAID-10 data in that piece as well if the testbed is still assembled when I'm back at the office next week.
I see you recommend the Intel
I see you recommend the Intel 750 800GB for the pro-sumers out there. Would you recommend it over the new Samsung 950 Pro 512GB coming out next month?
Those two different SSDs are
Those two different SSDs are going to have their own use cases. The 950 PRO will only be available in M.2 and at 512GB max (initially), while the SSD 750 is available in 800GB and 1.2TB. The 950 PRO should be a lower cost, but those without an M.2 slot will need an adapter. I think they will be close enough on performance that it will boil down more to fitment and cost.
Why would they use a PEX8718
Why would they use a PEX8718 chip? You don’t need 16x PCIe 3.0. 8X would suffice.
The chip has 16 PCIe lanes
The chip has 16 PCIe lanes *total*, some of which need to connect to the controllers. This one is configured to send 8 lanes to the host and 4 lanes to each controller. 8+4+4 = 16.
Thanks for clarification.
Thanks for clarification. That makes more sense now.
Those graphs man, hard to
Those graphs man, hard to wrap my head around some of them
Lots of data in a small
Lots of data in a small space, but if you know what your specific workload is, I think they get the job done.
just Wow !
just Wow !
we’re putting together a unix
we’re putting together a unix rig to sit in a data center and just compute 24/7. as many cores as we can afford, dual gpu nvidia and 64 gigs ram.
programmer is deranged by 3608 for boot (and everything really) and i want to make sure lanes are sufficient.
any 2011v3 boards stand out for this use?