It is difficult to know what is actually new information in this Intel blog post, but it is interesting none-the-less. Its topic is the AVX-512 extension to x86, designed for Xeon and Xeon Phi processors and co-processors. Basically, last year, Intel announced "Foundation", the minimum support level for AVX-512, as well as Conflict Detection, Exponential and Reciprocal, and Prefetch, which are optional. This, earlier blog post was very much focused on Xeon Phi, but it acknowledged that the instructions will make their way to standard, CPU-like Xeons at around the same time.
This year's blog post brings in a bit more information, especially for common Xeons. While all AVX-512-supporting processors (and co-processors) will support "AVX-512 Foundation", the instruction set extensions are a bit more scattered.
Xeon
Processors
|
Xeon Phi
Processors
|
Xeon Phi
Coprocessors (AIBs)
|
|
---|---|---|---|
Foundation Instructions | Yes | Yes | Yes |
Conflict Detection Instructions | Yes | Yes | Yes |
Exponential and Reciprocal Instructions | No | Yes | Yes |
Prefetch Instructions | No | Yes | Yes |
Byte and Word Instructions | Yes | No | No |
Doubleword and Quadword Instructions | Yes | No | No |
Vector Length Extensions | Yes | No | No |
Source: Intel AVX-512 Blog Post (and my understanding thereof).
So why do we care? Simply put: speed. Vectorization, the purpose of AVX-512, has similar benefits to multiple cores. It is not as flexible as having multiple, unique, independent cores, but it is easier to implement (and works just fine with having multiple cores, too). For an example: imagine that you have to multiply two colors together. The direct way to do it is multiply red with red, green with green, blue with blue, and alpha with alpha. AMD's 3DNow! and, later, Intel's SSE included instructions to multiply two, four-component vectors together. This reduces four similar instructions into a single operating between wider registers.
Smart compilers (and programmers, although that is becoming less common as compilers are pretty good, especially when they are not fighting developers) are able to pack seemingly unrelated data together, too, if they undergo similar instructions. AVX-512 allows for sixteen 32-bit pieces of data to be worked on at the same time. If your pixel only has four, single-precision RGBA data values, but you are looping through 2 million pixels, do four pixels at a time (16 components).
For the record, I basically just described "SIMD" (single instruction, multiple data) as a whole.
This theory is part of how GPUs became so powerful at certain tasks. They are capable of pushing a lot of data because they can exploit similarities. If your task is full of similar problems, they can just churn through tonnes of data. CPUs have been doing these tricks, too, just without compromising what they do well.
Ok Scott, you just proved
Ok Scott, you just proved you’re smarter than me… All i read in this article is “Blah, Blah, Blah,…Blah, Blah, Blah, “.
What does this thing do and why should the average PcPer fan care? How would I use this?
Color me stupid i guess….
It allows each processor core
It allows each processor core to do even more actual operations per "operation". It lets a lot of data be crammed into (basically) single steps.
Currently, it's only announced for Xeon and Xeon Phi. We'll see if it trickles back into consumer (I'd be surprised if it doesn't).
In layman’s terms it gives
In layman’s terms it gives every CPU core 32 GPU cores. GPU cores are tiny because they all execute the same program (instead of being fully independent), which makes them ideal for graphics. But there’s many other things than graphics that are repetitive that could take advantage of GPU-like cores (e.g. physics, some types of AI, etc.). The actual GPU is often too distant to exchange results efficiently, so AVX-512 brings these kind of cores within the CPU cores themselves.
So your future CPU could be an 8-core, enhanced with 256 parallel cores from AVX-512. Thanks to the higher frequency this makes them more powerful than today’s IGPs/APUs at parallel processing.
I don’t see that much use for
I don’t see that much use for this going forward in the consumer market. If you have something that is parallel enough to take advantage of these instructions, then it would probably run faster on a GPU anyway. Having the instructions directly in the ISA may perform batter compared to having OpenCL or other translation layer, although GPUs generally have the memory bandwidth to support such computation while CPUs may not.
If we get APUs with the GPU tightly integrated, they may be able to just emulate these instructions on GPU units without actually using units in the CPU. The floating point vector units take up a lot more die space than the integer units, and executing such code uses up a lot of bandwidth and cache somewhat unnecessarily. This is why AMD implemented multi-threading using multiple integer cores, with shared FP units. Integer execution cores are tiny; a lot of the space on a cpu is cache, followed by instruction decode (at least for x86), and then probably FP units. If all of the vector instructions could be handled by GPU hardware, then a CPU core could probably just have a low-latency, scalar FPU for the occasional FP instruction mixed in with integer code.
I don’t look at it that way.
I don't look at it that way. If consumer processors get it, more software will compile for it. Thus, more software will be compiled for Xeon Phi.
AVX-512 will destroy GPGPU.
AVX-512 will destroy GPGPU. Not the other way around. GPU manufacturers have been trying to increase the use of the GPU by generic applications, and some even claimed the CPU would become irrelevant, but after 10 years they’ve still achieved very little. The only successful ‘GPGPU’ applications are actually graphics applications, which really just makes it ‘GPU’ applications.
The problem is that there is a large overhead in communicating between the CPU and GPU. And it’s not getting any better. The computing power of the CPU and GPU increases much faster than the speed (both bandwidth and latency) at which they can exchange data. It’s a losing battle. APUs have improved the overhead, but it’s not nearly enough.
The only solution is to bring bring the GPU’s wide SIMD units within the CPU cores. And that’s exactly what AVX-512 is. It eliminates the overhead for switching between scalar processing and parallel processing.
GPUs having more bandwidth is actually a necessity and a burden instead of an advantage. They need it because their only means of hiding instruction latency is by switching between lots of threads. These threads consume lots of on-chip storage for temporary results, leaving no room for a big cache to efficiently store working data and lower memory access latency. So they waste bandwidth by storing things in RAM and reading/writing it multiple times. As GPUs increase in computing power they are desperate for more bandwidth and so they run the memory interfaces at 6+ GHz, which is smoking hot! They can’t do this any longer. They have to become more CPU-like to run fewer threads and store more relevant data per thread in local caches and reuse data efficiently.
AVX-512 is just part of a much bigger convergence between the CPU and GPU. It can be extended up to 1024-bit, which will rival GPUs at their own game. Eventually they’ll unify into a single architecture which combines the best of both worlds and enables a lot of new ‘accelerated’ applications to emerge.
Is software 3DNow! AMD’s
can
Is software 3DNow! AMD’s
can work with AVX-512 ?
No. 3DNow! is comparable to
No. 3DNow! is comparable to SSE1 — but even then, software is not compatible. As a matter of fact, AMD removed 3DNow! from its processors.
Of course Intel will charge
Of course Intel will charge $2500, or more, for the larrabee, and AMD will still be the $/Flop winner, and how much bandwidth will this Xeon PI have, Nvidia’s mezzanine module GPU and Power8, will be a number cruncher also, and we are probably talking about a 1024, bit bus for Nvidia’s mezzanine module based Power8 systems. AMD is also working on stacked RAM, 4 layer, with 1024 bit bus, APU. and those 32 SIMD per core, how wide is the bus leading to the memory for the Xeon PI, and can it even move the results out of the pi’s cores fast enough?
According to this article,
According to this article, the next generation Xeon Phi will have 8/16 GB of MCDRAM which offers 500 GB/s of bandwidth, and up to 384 GB of 6-channel DDR4 at 2400 MHz. But that’s not all. Xeon Phi has large internal caches which offer several TB/s of bandwidth. So there’s no shortage of bandwidth or capacity, and this is power efficient hierarchical setup.
Xeon Phi isn’t aimed at consumers, but will be perfect for supercomputer needs.
AVX-512 in consumer CPUs also won’t be starved for bandwidth, thanks to the large L3 cache, DDR4 support, and an optional L4 DRAM cache. AMD’s APUs might still be the theoretical FLOP/$ winner, but that’s not a relevant metric. AVX-512 has tons of advantages in programmability and ease of achieving actual practical performance out of it, mostly thanks to being a homogeneous part of the CPU cores. Heterogeneous computing is a dead end due to being unattractive to developers and suffering from the Bandwidth Wall and Amdahl’s Law.
Developers have IDEs,
Developers have IDEs, frameworks, and APIs to hold their hands, and by developers you mean non systems programmers!
Don’t worry scriptkiddys just call the API! So OpenMP, and OpenCL, and Nvidia has CUDA. Powre8 with Nvidia’s on mezzanine module GPUs will directly compete with Xeon Pi. High performance computing can benefit from commodity pricing, just like the Mobile market benefits from commodity pricing. AMD better get to work on some high performance APUs for the HPC market, hopefully with stacked memory, on a mezzanine module like Nvidia’s. Intel is not the only company using many cores, and there is an ARM based product, with many ARM cores, and CAPI interconnect fabric. HSA is not dead, and HSA has been around in HPC for quite a while. Developers, the cubical dwelling kind, need not fear, for the system developers, IDEs, and API, have them covered. x86 lost its software advantage years ago, and the software ecosystems, around Power/Power8, and ARM are up to the task, of abstracting the hard parts for the “developers” who can not find their way around HSA environments. The HSA foundation is not going away, and has new members joining, and IBM has its OpenPower, and other technological sharing associations, and Intel will find it harder and harder to control any market, in a not to far off a time. AMD is not the originator of HSA, just an adherent to HSA’s scientific principals, Unified memory, and seamlessly sharing compute workloads between GPUs and CPUs, and other compute units. And having a 1024 bit bus connecting all the mezzanine modules components is not what I call bandwidth constrained. Once the Nvidias, AMDs, Apples, and Samsungs begun to license Power8, along with ARM, things will not be so rosy in High margin land.