Five Intel SSD 750s Tested – Two Million IOPS and 10 GB/sec – Achievement Unlocked!
Two Million IOPS and 10 GB/sec of SSD Goodness!
It's been a while since we reviewed Intel's SSD 750 PCIe NVMe fire-breathing SSD, and since that launch we more recently had some giveaways and contests. We got the prizes in to be sent out to the winners, but before that happened, we had this stack of hardware sitting here. It just kept staring down at me (literally – this is the view from my chair):
That stack of 5 Intel SSD 750’s was burning itself into my periphery as I worked on an upcoming review of the new Seiki Pro 40” 4K display. A few feet in the other direction was our CPU testbed machine, an ASUS X99-Deluxe with a 40-lane Intel Core i7-5960 CPU installed. I just couldn't live with myself if we sent these prizes out without properly ‘testing’ them first, so then this happened:
This will not be a typical complete review, as this much hardware in parallel is not realistically comparable to even the craziest power user setup. It is more just a couple of hours of playing with an insane hardware configuration and exploring the various limits and bottlenecks we were sure to run into. We’ll do a few tests in a some different configurations and let you know what we found out.
First order of business was to see if the hardware would even be properly recognized. Amazingly enough, it was. The X99-Deluxe had no issue seeing all five SSD 750’s and Windows 8.1 enumerated them without issue. We had the Intel NVMe driver installed from previous testing, and all five devices automatically used that driver instead of the Windows 8.1 In-Box driver. No issues there, so then it was onto some simple RAID testing. I set up a quick Windows RAID-0 via Disk Management, selecting 4K as the block size as to match the 750’s ‘sweet spot’. Here is a quick ATTO test result from that configuration:
Don’t get me wrong, hitting over 4 GB/s is nothing to sneeze at, but these SSDs are rated for 900 MB/sec writes and 2.2 GB/sec reads, and we have five of them, so what gives?!?! Storage Spaces was even worse than the above result due to the way it allocates relatively large blocks of storage across devices in the pool. Well the first answer is that software-based RAID is not the best in terms of performance gains when stacking multiple fast devices, and we are currently forced to use such a solution when working with PCIe devices since they link directly to the CPU with no other RAID-capable in-between. This is a luxury that only SATA and SAS devices can currently employ – or is it?
While the Windows RAID solution adds a bunch of software layers that limit the ultimate performance of this sort of configuration, what if we had a piece of software that spoke to the SSDs individually, and in parallel? An application meant to demand the most of SSDs would theoretically be coded in such a way, and we can configure Iometer to do the same sort of thing by delegating workers across multiple storage devices. Each worker acts in its own thread, and this can simulate a multi-threaded application hitting multiple storage devices with multiple simultaneous IO requests.
I had initially planned to run some Iometer tests with the Windows RAID established, but watching the system behave during the test file creation process revealed a configuration / bottleneck issue. Low speeds in the Iometer file creation process is expected as it is a low queue depth operation, but disk 5 was going half the speed of the others, with roughly double the active time, suggesting a slower link than the others. Time to hit the books:
Slot 2 was the 'slow' slot and was being enumerated by the OS as disk #5, which seemed odd at the time but now it made sense, since it is not connected directly to the CPU's PCIe 3.0 lanes. While we are only using 20 lanes worth of connectivity for our five PCIe 3.0 x4 devices, the X99-Deluxe is not designed around this sort of connectivity. It’s a game of design challenges and compromises when using a limited resource, and even though 40 PCIe lanes may seem like a lot, they can get gobbled up quickly by slots and integrated devices. The typical use case for that board would be to provide sets of 8 or 16 lanes to as many of the slots as possible, with priority in linking those lanes directly to the CPU (even if we are only installing x4 devices). This means we have slots 1 and 3-6 all going straight to the CPU, with ASUS choosing to take some of the leftover lanes from the Southbridge and diverting them to slot 2. There is still a catch, as even those slower PCIe 2.0 lanes are shared with one of the USB 3.0 controllers and one of the SATA Express ports. Using slot 2 kills those other functions unless overridden in the BIOS. End result: one of our SSD 750’s is communicating with the system via the Southbridge, which links to the CPU via DMI. While DMI can go ~2 GB/sec, the link is shared with other devices, and in this configuration the best we saw from the fifth SSD 750 was just over 800 MB/sec:
That’s actually a respectable figure considering we were jumping through so many hoops to get data to that slot. DMI appeared to be the limit here, with obvious added latency and possible bandwidth sharing with other devices hanging off of the DMI, negatively impacting the ultimate IOPS we can see from an SSD 750 connected via this path.
With that limit established, I tried swapping the GPU into slot 2 and giving the fifth SSD 750 the primary slot, but the system simply would not have it. This meant I would be scaling back the remainder of the random IO portion of the experiment to use just four of the five SSD 750’s, which actually came in handy as we only have 16 threads available with the i7-5960X’s hyperthreaded 8 cores. I whipped up a quick Iometer configuration, allocating one worker per SSD for sequential performance and four workers to each SSD 750 for random IO testing and let ‘er rip. Here are the results:
128K sequential writes, QD32 (per SSD), 4x SSD 750:
The conditioning pass took place at a steady 5GB/sec across all five units. This speed did not drop even after 'lapping' the drives (writing over areas already written).
128K sequential reads, QD32 (per SSD), 4x SSD 750:
Just over 9.5 GB/sec, but can we go faster?
128K sequential reads, QD32 (per SSD), 5x SSD 750 (4x PCIe 3.0 x4 / 1x PCIe 2.0 x4):
While the fifth SSD cannot reach the full speed of the others, it still helps nudge us past the 10 GB/sec point.
4K random reads, QD128 (per SSD), 4x SSD 750:
Nearly 1.8 million 4K IOPS. Definitely good, but I still think this CPU is capable of just a bit more.
You may be wondering why I didn’t include random IO for all five SSDs. This was because the additional unit connected over a higher latency link, combined with diverting CPU threads to that slower device, slowed the overall result and made it far more inconsistent. With this testing out of the way, it looked like just under 1.8 million IOPS was the highest 4K random figure we would hit with four fully conditioned (i.e. all sectors written at least once) SSDs. We were pegging the CPU cores (all overclocked to 4.5 GHz) at the same time we were hitting the maximum throughput capability of the SSDs, so we couldn’t definitively make the call on which was a bottleneck. I believe that at this point *both* were contributing to the bottleneck, but I can say that we were able to squeeze just over 2 million IOPS out of this configuration before the SSDs were conditioned (fully written):
The above result only needed QD64 per SSD to achieve because they were unconditioned and were more able to exceed their rated 430,000 IOPS per SSD. Before conditioning, the SSD 750’s were simply returning null data (Deterministic Read Zero after Trim – DZAT) since we were trying to read data that had never been written. Since the 750’s did not have to do as much legwork to respond to the IO requests, this shifted the bottleneck more towards the CPU and Windows kernel’s ability to handle them. This means that based on what we’ve seen here, a Core i7-5960X overclocked to 4.5 Ghz pegs at ~2 million IOPS to NVMe devices under Windows 8.1.
Before we close, let’s look at another couple of fun pieces of data. Here’s what the CPU cores look like ramping up from QD=1 per worker to QD=32 per worker, exponentially (6 steps):
As you can see, the Windows scheduler is intelligent about allocating threads, keeping spreading the load across the eight cores for as long as possible before it ramps up the second thread of each of those cores.
Finally, here’s what the set of four SSD 750’s look like running our Web Server workload. Realize this is not an apples to apples comparison as we are on a higher clock CPU with the workload configured across 16 workers (we historically ran this test with only one worker). The charted line for the new data is shifted to the right as a 16x multiplier is effectively present in the resulting data:
The high number there is only 519k IOPS, but our Web Server test is far more demanding than a simple 4K random workload, with some of the IO falling below the 4K sweet spot for the SSD 750. While not as high as the figures seen earlier in this review, this quad of storage is certainly capable of some impressive figures when compared to the others in that same chart.
Finally, here are the sequential and random read results in chart form:
So there you have it. Hopefully you learned a thing or two from this exercise. I know I did! Interesting things to consider when planning out your next insane storage system!
I’m sure that you took on
I’m sure that you took on this … responsibility … to check out these cards with an occasional … “My precious!”.
Truly Epic! and yes
Truly Epic! and yes definitely worth an achievement/unlock or three 🙂
You are consistently AMAZING,
You are consistently AMAZING, Allyn!
> That’s actually a respectable figure considering we were on a PCIe 2.0 x4 link, which works out to 250 MB/sec x4, minus overhead from 8b/10b encoding.
Could you clarify a point I may have missed:
Were some of those 750s running at PCIe 2.0 speed,
while others were running at PCIe 3.0 speed?
I believe PCIe 3.0 implements a 128b/130b “jumbo frame”
instead of the PCIe 2.0 8b/10b legacy frame, correct?
Also, I believe PCIe 2.0 clocks at 5G, PCIe 3.0 at 8G, correct?
PCIe 2.0: 5G / 10.0 = 500 MBps
PCIe 3.0: 8G / 8.125 = ~1,000 MBps
(130 bits / 16 bytes = 8.125 bits per byte, jumbo frane)
Good catch, and my math was
Good catch, and my math was off. I can only suspect it was negotiating at PCIe 2.0 x2. We have seen PCIe 2.0 x4 saturate right at 1500 MB/sec (source – first atto result). The bottleneck may have been elsewhere as well. Will revise accordingly.
Why do you have an NSF
Why do you have an NSF (National Science Foundation?) branded shelving unit and where can I get one?
All of our shelving came from
All of our shelving came from Sam's Club. 🙂
This is the one pictured: http://www.samsclub.com/sams/22-bin-rack-22-bins-gray/prod11400192.ip?navAction=
Could you do one of them videos, to show windows loading and then loading a bunch of apps, etc?
Just want to see if it “feels” faster 😉
Can’t hardware RAID them, so
Can't hardware RAID them, so no Windows install across the set. We had Windows on a SATA SSD for these tests 🙂
Admiring benchmark scores is
Admiring benchmark scores is fun and all, but it doesn’t look like SSDs will ever reach practical prices.
It wasn’t too long ago that
It wasn't too long ago that HDD's were $1/GB or more recently $0.40/GB, which is what SSDs can be had for today.
When operating systems are
When operating systems are stored in high-speed non-volatile memory, it will be a cinch to create a drive image of same in very little time. If Allyn’s setup supported a bootable RAID array, it would have approximated very closely the type of memory-resident OS that was the subject of a provisional patent application we filed several years ago. He was also kind enough to respond to my private email message — by clarifying that software RAID arrays are not bootable. So, I wrote to ATTO and they replied that they have no current plans to develop an NVMe PCIe RAID controller compatible with 2.5″ model 750s. We’ll just need to wait on that one.
Are you sure you picked the
Are you sure you picked the winners already? I know I won, but didn’t get my email yet.
thats a lot of iops lol. im
thats a lot of iops lol. im hoping one day we support bootable raid 0 via intel storage so we can take advantage of 4k random write boost.
I may be wrong about this
I may be wrong about this analysis, but it seems to me that a RAID-5 array using 4 x 2.5″ Intel 750 SSDs should afford the same or better level of data integrity + the extra speed should be worth the extra cost.
Am I wrong to expect compatible PCIe NVMe add-on RAID controllers that should exploit the raw upstream bandwidth of PCIe 3.0? That standard uses 128b/130b jumbo frames and an 8 GHz clock, so each PCIe 3.0 lane moves ~1.0 GBps
(8 GHz / 8.125 bits per byte).
I would prefer an add-on controller to a PCIe 750: what happens if some component fails inside the PCIe version? Will ALL the data be lost? I wonder.
p.s. Is TRIM a feature of
p.s. Is TRIM a feature of the NVMe command set?
Of course. 🙂
Of course. 🙂
Jaw dropping performance but
Jaw dropping performance but hardly relevant in day to day home/media PC. That kind of of power belongs to the enterprise where it can be properly utilized. What we need next are CPUs with like 80 lanes and x16 PCI-Ex AIC cards with 4 drives. 🙂 Really more interested in classic SATA SSDs with massive capacity. It’s time to move all these TBs of RAID from clunky HDDs to SSDs… 5 TBs SSD for 300$ anyone?
Seriously tho. Excellent work Allyn as always.
BTW: did you tested RAID 10 on 4 worker setup? Very interested in that. It’s no secret that I’m very much allergic to RAID0.
The ATTO results were for
The ATTO results were for straight RAID-0. RAID-10 would not have been any better (likely worse). Windows RAID just did not want to go that fast 🙂
Nothing worse than getting a
Nothing worse than getting a new gadget that someone else has opened. Not that anyone would say no to one.
I would gladly suffer the upset of receiving one if u are giving one away.
I just got the 750
I just got the 750 series 1.2 tb. After some reading it’s faster then the 400 gb model. So 5 of these would be fun to see how much faster then the 5 – 400 gb model you tested.
On the system we were testing
On the system we were testing on, we would not see any further IOPS performance as we were pegging all CPU cores. Might have seen a bit more on the sequentials though.
For being such a big storage
For being such a big storage guy I am not sure why you would try to use Windows Raid instead of storage spaces. Watch the video below if you want to see some actual performance. They only use 3 NVMe drives in the second demo and beat your performance while using less than 20% cpu.
Windows Server 2016 TP2 with SMB3, Storage Spaces, Micron NVMe and Mellanox 100GbE
Jump to the summary at
Jump to the summary at 1:13:13
I see some of the effects on upstream bandwidth
that result from the PCIe 3.0 8G clock and
128b/130b jumbo frame, specifically:
100 Gbps RDMA
11 GBps from one NIC port
Clearly, if the obsolete PCIe 2.0 legacy frame
had been used (8b/10b), THEN
100 Gbps / 10 bits per byte = 10 GBps MAX HEADROOM
withOUT including controller overheads.
So, that was a very promising result
from only 3 x Micron PCIe NVMe SSDs.
Now, scale up using 4 x 2.5″ NVMe SSDs
with a PCIe RAID controller with an
x8 or x16 edge connector.
That much bandwidth certainly compensates
from the overhead expected to result
from higher RAID levels e.g. RAID-5 or RAID-6.
Thanks for the link to that presentation.
In the video they don’t
In the video they don’t mention what the hardware configuration is, but if they are showing Storage Spaces in a high performance environment, the chances are it’s at least a dual Xeon server, with way more CPU cores per CPU than the desktop test machine here.
I’m a huge fan of Storage Spaces, but to try to do a performance comparison without knowing the specs is kind of tough. At home I’ve a Dell Precision 7600 workstation with 2x Xeon E5-2687 which I primarily use as a Hyper-V server, I’m serious thinking about getting a couple of the 750s to try in it whereas I’m currently just using a couple of Samsung drives. I know these aren’t server optimised, but the workloads I’m running should still benefit dramatically.
The other thing with Storage Spaces is that I’ve seen it hit a quad core Xeon E3 (1230 or 1220 variant, need to verify) incredibly hard with de-dupe enabled with multiple large sustained reads with a 2 column SSD and HDD config, so I wasn’t surprised to see Allyns results with a single CPU, despite having more CPU cores.
That much bandwidth certainly compensates
from the overhead
That much bandwidth certainly compensates
for the overhead
Isn’t the Dell Precision 7600
Isn’t the Dell Precision 7600 using PCI-Express 2.0?
SFFWG Renames PCIe SSD SFF-8639 Connector to U.2
FYI: repeating my post at
FYI: repeating my post at the related article reporting name change to U.2 :
Found this today:
Here are Newegg’s pages for those 2 cards:
Reported: 4.2GB/s with
Reported: 4.2GB/s with Supermicro AOC-SLG3-2E4 and 2 x 2.5″ Intel 750 SSD:
OK both drives are in. I setup a Windows RAID 0 array.
This is pretty ridiculous. Xeon D-1540, 128GB RAM, and 4.2GB/s two disk storage array that is running off a 120w PicoPSU.
@dba – how much would that much throughput cost just a few years back?
AOC-SLG3-2E4 seems to be working.
This article has photos of
This article has photos of the Intel A2U44X25NVMEDK kit:
Four Potential Solutions
The four solutions we are looking at today are:
ASUS Hyper Kit ($22 street price)
Supermicro AOC-SLG3-2E4R ($149 street price)
Supermicro AOC-SLG3-2E4 ($249 street price)
Intel A2U44X25NVMEDK ($500 street price)
Our hope is that someone releases a generic version of this kit, perhaps with a larger PLX switch than is found on the AOC-SLG3-2E4 card. That would allow users to bypass the tough to find U.2 cables and upgrade existing systems easily.
Bottom line: awesome kit, but unless you have one of the specific Intel systems that works with the kit, the PCIe x16 card is unlikely to work in your machine. We hope someone builds a generic version of this kit.
The state of adding 2.5″ drive support to existing systems is meager to say the least.
There are really two options if you want to add drives to existing systems: the ASUS Hyper Kit ($22 street price) or the Supermicro AOC-SLG3-2E4 ($249 street price).
The former carries a low initial purchase price while the latter allows for expanding to significantly higher densities than had been previously possible.
While m.2 SSDs in gum stick form factors make excellent client SSDs[,] the prospect of using higher-end drives in servers that come with both power loss protection and higher write endurance is very exciting. Hopefully as we see more options available soon.
For the majority of servers and workstations with few if any m.2 slits, the Supermicro AOC-SLG3-2E4 is the card to get.
Be very careful as the much less expensive R version is designed for specific systems.
We did not test for booting off of NVMe drives. At this point, in servers and workstations where there is generally more space, booting off of a SATA SSD or USB drive (for embedded systems) is much easier.
Windows has a built-in driver, Linux has supported NVMe out of the box for some time, even FreeBSD now supports NVMe. That makes NVMe extremely easy to work with versus custom PCIe storage such as Fusion-io.
Our advice: boot off of SATA / USB then add NVMe as data/ application drives as necessary.
So far, ASUS X99-Deluxe can
So far, ASUS X99-Deluxe can play Intel NVME SSD 6 PCI-E Slot.