Five Intel SSD 750s Tested – Two Million IOPS and 10 GB/sec – Achievement Unlocked!

Two Million IOPS and 10 GB/sec of SSD Goodness!

It's been a while since we reviewed Intel's SSD 750 PCIe NVMe fire-breathing SSD, and since that launch we more recently had some giveaways and contests. We got the prizes in to be sent out to the winners, but before that happened, we had this stack of hardware sitting here. It just kept staring down at me (literally – this is the view from my chair):

That stack of 5 Intel SSD 750’s was burning itself into my periphery as I worked on an upcoming review of the new Seiki Pro 40” 4K display. A few feet in the other direction was our CPU testbed machine, an ASUS X99-Deluxe with a 40-lane Intel Core i7-5960 CPU installed. I just couldn't live with myself if we sent these prizes out without properly ‘testing’ them first, so then this happened:

This will not be a typical complete review, as this much hardware in parallel is not realistically comparable to even the craziest power user setup. It is more just a couple of hours of playing with an insane hardware configuration and exploring the various limits and bottlenecks we were sure to run into. We’ll do a few tests in a some different configurations and let you know what we found out.

First order of business was to see if the hardware would even be properly recognized. Amazingly enough, it was. The X99-Deluxe had no issue seeing all five SSD 750’s and Windows 8.1 enumerated them without issue. We had the Intel NVMe driver installed from previous testing, and all five devices automatically used that driver instead of the Windows 8.1 In-Box driver. No issues there, so then it was onto some simple RAID testing. I set up a quick Windows RAID-0 via Disk Management, selecting 4K as the block size as to match the 750’s ‘sweet spot’. Here is a quick ATTO test result from that configuration:

Don’t get me wrong, hitting over 4 GB/s is nothing to sneeze at, but these SSDs are rated for 900 MB/sec writes and 2.2 GB/sec reads, and we have five of them, so what gives?!?! Storage Spaces was even worse than the above result due to the way it allocates relatively large blocks of storage across devices in the pool. Well the first answer is that software-based RAID is not the best in terms of performance gains when stacking multiple fast devices, and we are currently forced to use such a solution when working with PCIe devices since they link directly to the CPU with no other RAID-capable in-between. This is a luxury that only SATA and SAS devices can currently employ – or is it?

While the Windows RAID solution adds a bunch of software layers that limit the ultimate performance of this sort of configuration, what if we had a piece of software that spoke to the SSDs individually, and in parallel? An application meant to demand the most of SSDs would theoretically be coded in such a way, and we can configure Iometer to do the same sort of thing by delegating workers across multiple storage devices. Each worker acts in its own thread, and this can simulate a multi-threaded application hitting multiple storage devices with multiple simultaneous IO requests.

I had initially planned to run some Iometer tests with the Windows RAID established, but watching the system behave during the test file creation process revealed a configuration / bottleneck issue. Low speeds in the Iometer file creation process is expected as it is a low queue depth operation, but disk 5 was going half the speed of the others, with roughly double the active time, suggesting a slower link than the others. Time to hit the books:

Slot 2 was the 'slow' slot and was being enumerated by the OS as disk #5, which seemed odd at the time but now it made sense, since it is not connected directly to the CPU's PCIe 3.0 lanes. While we are only using 20 lanes worth of connectivity for our five PCIe 3.0 x4 devices, the X99-Deluxe is not designed around this sort of connectivity. It’s a game of design challenges and compromises when using a limited resource, and even though 40 PCIe lanes may seem like a lot, they can get gobbled up quickly by slots and integrated devices. The typical use case for that board would be to provide sets of 8 or 16 lanes to as many of the slots as possible, with priority in linking those lanes directly to the CPU (even if we are only installing x4 devices). This means we have slots 1 and 3-6 all going straight to the CPU, with ASUS choosing to take some of the leftover lanes from the Southbridge and diverting them to slot 2. There is still a catch, as even those slower PCIe 2.0 lanes are shared with one of the USB 3.0 controllers and one of the SATA Express ports. Using slot 2 kills those other functions unless overridden in the BIOS. End result: one of our SSD 750’s is communicating with the system via the Southbridge, which links to the CPU via DMI. While DMI can go ~2 GB/sec, the link is shared with other devices, and in this configuration the best we saw from the fifth SSD 750 was just over 800 MB/sec:

That’s actually a respectable figure considering we were jumping through so many hoops to get data to that slot. DMI appeared to be the limit here, with obvious added latency and possible bandwidth sharing with other devices hanging off of the DMI, negatively impacting the ultimate IOPS we can see from an SSD 750 connected via this path.

With that limit established, I tried swapping the GPU into slot 2 and giving the fifth SSD 750 the primary slot, but the system simply would not have it. This meant I would be scaling back the remainder of the random IO portion of the experiment to use just four of the five SSD 750’s, which actually came in handy as we only have 16 threads available with the i7-5960X’s hyperthreaded 8 cores. I whipped up a quick Iometer configuration, allocating one worker per SSD for sequential performance and four workers to each SSD 750 for random IO testing and let ‘er rip. Here are the results:

128K sequential writes, QD32 (per SSD), 4x SSD 750:

The conditioning pass took place at a steady 5GB/sec across all five units. This speed did not drop even after 'lapping' the drives (writing over areas already written).

128K sequential reads, QD32 (per SSD), 4x SSD 750:

Just over 9.5 GB/sec, but can we go faster?

128K sequential reads, QD32 (per SSD), 5x SSD 750 (4x PCIe 3.0 x4 / 1x PCIe 2.0 x4):

While the fifth SSD cannot reach the full speed of the others, it still helps nudge us past the 10 GB/sec point.

4K random reads, QD128 (per SSD), 4x SSD 750:

Nearly 1.8 million 4K IOPS. Definitely good, but I still think this CPU is capable of just a bit more.

You may be wondering why I didn’t include random IO for all five SSDs. This was because the additional unit connected over a higher latency link, combined with diverting CPU threads to that slower device, slowed the overall result and made it far more inconsistent. With this testing out of the way, it looked like just under 1.8 million IOPS was the highest 4K random figure we would hit with four fully conditioned (i.e. all sectors written at least once) SSDs. We were pegging the CPU cores (all overclocked to 4.5 GHz) at the same time we were hitting the maximum throughput capability of the SSDs, so we couldn’t definitively make the call on which was a bottleneck. I believe that at this point *both* were contributing to the bottleneck, but I can say that we were able to squeeze just over 2 million IOPS out of this configuration before the SSDs were conditioned (fully written):

The above result only needed QD64 per SSD to achieve because they were unconditioned and were more able to exceed their rated 430,000 IOPS per SSD. Before conditioning, the SSD 750’s were simply returning null data (Deterministic Read Zero after Trim – DZAT) since we were trying to read data that had never been written. Since the 750’s did not have to do as much legwork to respond to the IO requests, this shifted the bottleneck more towards the CPU and Windows kernel’s ability to handle them. This means that based on what we’ve seen here, a Core i7-5960X overclocked to 4.5 Ghz pegs at ~2 million IOPS to NVMe devices under Windows 8.1.

Before we close, let’s look at another couple of fun pieces of data. Here’s what the CPU cores look like ramping up from QD=1 per worker to QD=32 per worker, exponentially (6 steps):

As you can see, the Windows scheduler is intelligent about allocating threads, keeping spreading the load across the eight cores for as long as possible before it ramps up the second thread of each of those cores.

Finally, here’s what the set of four SSD 750’s look like running our Web Server workload. Realize this is not an apples to apples comparison as we are on a higher clock CPU with the workload configured across 16 workers (we historically ran this test with only one worker). The charted line for the new data is shifted to the right as a 16x multiplier is effectively present in the resulting data:

The high number there is only 519k IOPS, but our Web Server test is far more demanding than a simple 4K random workload, with some of the IO falling below the 4K sweet spot for the SSD 750. While not as high as the figures seen earlier in this review, this quad of storage is certainly capable of some impressive figures when compared to the others in that same chart.

Finally, here are the sequential and random read results in chart form:

So there you have it. Hopefully you learned a thing or two from this exercise. I know I did! Interesting things to consider when planning out your next insane storage system!