AVX-512 is an instruction set that expands the CPU registers from 256-bit to 512-bit. It comes with a core specification, AVX-512 Foundation, and several extensions that can be added where it makes sense. For instance, AVX-512 Exponential and Reciprocal Instructions (ERI) help solve transcendental problems, which occur in geometry and are useful for GPU-style architectures. As such, it appears in Knights Landing but not anywhere else.
Image Credit: Bits and Chips
Today's rumor is that Skylake, the successor to Broadwell, will not include any AVX-512 support in its consumer parts. According to the lineup, Xeons based on Skylake will support AVX-512 Foundation, Conflict Detection Instructions, Vector Length Extensions, Byte and Word Instructions, and Double and Quadword Instructions. Fused Multiply and Add for 52-bit Integers and Vector Byte Manipulation Instructions will not arrive until Cannonlake shrinks everything down to 10nm.
The main advantage of larger registers is speed. When you can fit 512 bits of data in a memory bank and operate upon it at once, you are able to do several, linked calculations together. AVX-512 has the capability to operate on sixteen 32-bit values at the same time, which is obviously sixteen times the compute performance compared with doing just one at a time… if all sixteen undergo the same operation. This is especially useful for games, media, and other, vector-based workloads (like science).
This also makes me question whether the entire Cannonlake product stack will support AVX-512. While vectorization is a cheap way to get performance for suitable workloads, it does take up a large amount of transistors (wider memory, extra instructions, etc.). Hopefully Intel will be able to afford the cost with the next die shrink.
“Fused Multiply and Add for
“Fused Multiply and Add for 52-bit Integers and Vector Byte Manipulation Instructions will not arrive until Cannonlake shrinks everything down to 10nm.”
52-bit integers? Is that a thing or a typo?
That’s correct, but I don’t
That's correct, but I don't know what they're referring to specifically, either.
See page 642 (744th electronic page) of Intel's Instruction Set Reference for an example.
The mantissa of a
The mantissa of a double-precision (64-bit) floating point value is 52 bits. So it’s trivial to handle integers with floating-point hardware as long as the values don’t exceed that size.
Yes, I know that the
Yes, I know that the precision of a 64-bit float is 53 bits (52 explicitly stored); I have never actually heard of this being used directly for integer operations though, which is why I was wondering if it was a typo. It isn’t an issue if you are actually working with 32-bit values, but using 64-bit integers could be dangerous depending on how overflow is handled.
Maybe showing my age here,
Maybe showing my age here, but I remember when x86 supported 80 bit floats which had a 64 bit mantissa and they offered 64 bit integer support that way. 8087, 287, 387….
I could see that as being
I could see that as being very useful to have support for 64-bit integers on a 16-bit processor. This idea is exactly what MMX is though. It uses the 64-bit mantissa of the x87 registers rather than a new register set, although they can be accessed as a flat register file rather than a stack. The new register set came with SSE. With the new hardware, the native format is 64-bit, so you can’t use the whole hardware pipeline for integers unless they are limited to 52-bit. Converting anything larger than 52-bits would lose precision. This still seems a bit strange to use a non-standard size. It seems like it would be better to extend the hardware to support 64-bit mantissas with proper modes to keep the floating point arithmetic limited to IEEE standards. I don’t know what the hardware cost if that would be though.
For the consumer market, I
For the consumer market, I don’t think this is necessary. If you have a gpu, even a low-end gpu, you presumably already have more flops than AVX-512 provides. These wide vector units take a lot of die space. They may also may cause other conflicts; do you optimize the cache and data paths to provide the bandwidth necessary for these units or optimize it for latency for the integer core?
so this is what happened to
so this is what happened to the work on Itanium/IA-64 architecture? It’s underlying core idea of a VLIW is reimplemented as extensions to the X86 architecture! Does AVX-512 also use an EPIC architecture or have they gone a different route? https://en.wikipedia.org/wiki/Itanium
VLIW is not equivalent to
VLIW is not equivalent to SIMD.
So, skipping Skylake. Check.
So, skipping Skylake. Check.