Performance Focus: 4x Optane Memory and 4x 960 PRO in VROC RAID-0
In the interest of speed, I'll be sticking with random and sequential reads only for the charts. Note that each array was fully sequentially written prior to testing (fresh out of the box SSDs that have never been written may 'cheat' and instantly return zeroes without even touching the flash, so the only real-world results can be obtained by reading from areas that have been previously written).
A matched set of four Optane Memory 32GB modules were used to evaluate IOPS and Latency, while a set of four Samsung 960 PRO 512GB SSDs were used to evaluate maximum throughput.
4KB Random Read:
IOPS:
Jumping right into these results, we have two groupings of IOPS curves. The bottom set shows the IOPS response of an increasing number of 960 PRO SSDs added to a RAID-0. The top set represents the same, but with varying numbers of Optane Memory modules in place of the 960 PROs.
Note how the IOPS performance of Optane is far superior to one of the fastest NAND SSDs we've tested to date. Four 960 PROs can only beat a single 32GB Optane Memory module, but can only do so at QD=32, and only because the Optane part had saturated by QD=8, giving Samsung time to catch up. With Optanes in a RAID, all bets are off, though there was a peculiarity noted at the lower queue depths, where it seemed any RAID configuration lost nearly half of its performance advantage over the NAND arrays. This becomes more clear if we break down the results in a different way, focusing more closely on the lower queue depths:
Note how the far left dark blue (QD=1) bar starts off at nearly 100,000 IOPS, but any of the next three blue bars fall closet to 50,000 IOPS. More on that shortly.
Latency:
Let's start by focusing on that lower left point. 10 microseconds is in line with the expected latency of Optane Memory (as observed in our prior detailed analysis of that part). Unfortunately, it appears that any form of VROC RAID applied adds 6 microseconds of latency. We've actually seen that number before in our triple M.2 RAID testing of the Z170 platform, but I was hoping for less of a negative impact with this newer platform, especially since the VMD controller is at the CPU / hardware level. Still, remember we are dealing with pre-release, well, everything here, so this is obviously subject to optimizations and improvements.
One general note on the above chart before we move on. Note that as you add SSDs, the latency profile rotates clockwise, effectively flattening and making it to higher QD's before curving upwards (latency begins to spike high due to increased controller/media loading).
The QD=1-4 bar chart makes the latency differences between Optane and 960 PRO painfully obvious.
128KB Sequential:
Note that we choose 128KB sequential as the kernel will break requests >128KB into multiple 128KB chunks issued in parallel (and at an effectively higher QD than desired).
Now we get to the fun part. The bottom cluster (starting from the ~1GB/s point and spanning out) are the Optane parts. These only link at PCIe 3.0 x2 and are not meant to excel at sequential performance. Still, by QD=8 we see them spread out to an even stack of increasing throughputs nearing 6GB/s. A single 960 PRO, with its x4 link and a controller channel layout better optimized for sequentials bisects the Optane throughputs, falling between the x2 and x3 Optane configurations. The rest of the 960 PRO configurations handily beat the Optane parts in sequential performance.
QD=16 is about as high as we've seen in our trace recording of Windows bulk file copy operations, so I've ended the bar chart spread at that depth. QD=32 is a moot point here anyway, as all configurations reached saturation closer to QD=8.
And now the chart you all came here to see:
Here we are looking only at QD=32 for the Optane and 960 PRO spread from a single to a quad-SSD RAID-0. We would ideally expect linear scaling here, and that appears to be exactly what happened. Quad Optane Memory 32GB hit 5.6GB/s, while quad 960 PRO 512GB achieved over 13.2 GB/s! We've certainly come a long way from the DMI bottlenecked Z170/Z270 limit of 3.6GB/s.
I downloaded the guide and I
I downloaded the guide and I think that on the review you might have missed a step to configure VROC. I see that you configured the hardware for connecting multiple drives to a configured set of lanes. On the guide they set specific VMD ports through an specific OCulink connection, whatever that is. They also configured the Volume Management Device as an OCulink connection. They did the same for every CPU the processor had. I’m assuming that the ASUS board has the ability to do this with a PCIe 3.0 connection. Correct if I’m wrong but I’m assuming that any RAID array created on the RSTe GUI will run under the PCH connection if the VMD ports aren’t linked to the PCIe 3.0 connection on the BIOS.
Anyone knows where I can find
Anyone knows where I can find the VROC key and the price? Intel says “contact your mainboard manifacturer” and Gigabyte (I have a GA-X299-UD4 with 2 x Samsung 960 PRO) says “contact your dealer” but I’m the dealer and I can’t find the key!
Thank you!
Hi, I have a couple questions
Hi, I have a couple questions about bandwidth if someone can answer them for me:
1. Would I experience a bottleneck with 4 x Samsung 960 Pros if I use this card in a x8 slot rather than a x16 slot? Will it make any noticeable difference?
2. How does this card compare to the dimm.2 risers on asus boards (Rampage VI Apex & Extreme)? The riser card provides 2 PCle x4 connections directly to the cpu. Does the Hyper m.2 x 16 card have additional overhead that would cause more latency than the riser cards?
As far as I know, but without
As far as I know, but without having actual empirical experience with 4 x Samsung 960 Pros, to exploit the raw bandwidth of an x16 slot the BIOS/UEFI must support what is called PCIe lane “bifurcation”.
In the ASUS UEFI, it shows up as x4/x4/x4/x4:
https://www.youtube.com/watch?v=9CoAyjzJWfw
In the ASRock UEFI, it shows up as 4×4:
http://supremelaw.org/systems/asrock/X399/
This allows the CPU to access a single x16 slot as four independent x4 PCIe slots.
As such, even if an x8 slot were able to be bifurcated, it would end up as 2×4, or x4/x4, and the other 2 NVMe SSDs would probably get ignored.
There are some versions of these add-in cards that have an on-board PLX chip, which may be able to address all 4 SSDs even if only x8 PCIe lanes are assigned to an x16 slot by the BIOS/UEFI.
(Also, by shifting the I/O processing to the CPU, this architecture should eliminate the need for dedicated RAID IOP’s on the add-in card.)
Also, a full x16 edge connector may not fit into an x8 mechanical slot.
Ideally, therefore, these “quad M.2” AICs are designed to install in a full x16 mechanical slot that is assigned the full x16 PCIe lanes with bifurcation support in the BIOS/UEFI subsystem.
You should ask this same question of Allyn, because he will surely have more insights to share with us here.
If anyone is interested,
If anyone is interested, ASRock replied to our query with simple instructions for doing a fresh install of Windows 10 to an ASRock Ultra Quad M.2 card installed in an AMD X399 motherboard. We uploaded that .pdf file to the Internet here:
http://supremelaw.org/systems/asrock/X399/
As you want something super
As you want something super new? Take a look at this site. Only here the choice of horny for every taste and completely free! They are hardcore slaves, they will and want implement anything you command !
http://gov.shortcm.li/kings1#F25