The add-in board version of the Xeon Phi has just launched, which Intel aims at supercomputing audiences. They also announced that this product will be available as a socketed processor that is embedded in, as PC World states, “a limited number of workstations” by the first half of next year. The interesting part about these processors is that they combine a GPU-like architecture with the x86 instruction set.
Image Credit: Intel (Developer Zone)
In the case of next year's socketed Knights Landing CPUs, you can even boot your OS with it (and no other processor installed). It will probably be a little like running a 72-core Atom-based netbook.
To make it a little more clear, Knights Landing is a 72-core, 512-bit processor. You might wonder how that can compete against a modern GPU, which has thousands of cores, but those are not really cores in the CPU sense. GPUs crunch massive amounts of calculations by essentially tying several cores together, and doing other tricks to minimize die area per effective instruction. NVIDIA ties 32 instructions together and pushes them down the silicon. As long as they don't diverge, you can get 32 independent computations for very little die area. AMD packs 64 together.
Knight's Landing does the same. The 512-bit registers can hold 16 single-precision (32-bit) values and operate on them simultaneously.
16 times 72 is 1152. All of a sudden, we're in shader-count territory. This is one of the reasons why they can achieve such high performance with “only” 72 cores, compared to the “thousands” that are present on GPUs. They're actually on a similar scale, just counted differently.
Update: (November 18th @ 1:51 pm EST) I just realized that, while I kept saying "one of the reasons", I never elaborated on the other points. Knights Landing also has four threads per core. So that "72 core" is actually "288 thread", with 512-bit registers that can perform sixteen 32-bit SIMD instructions simultaneously. While hyperthreading is not known to be 100% efficient, you could consider Knights Landing to be a GPU with 4608 shader units. Again, it's not the best way to count it, but it could sort-of work.
So in terms of raw performance, Knights Landing can crunch about 8 TeraFLOPs of single-precision performance or around 3 TeraFLOPs of double-precision, 64-bit performance. This is around 30% faster than the Titan X in single precision, and around twice the performance of Titan Black in double precision. NVIDIA basically removed the FP64 compute units from Maxwell / Titan X, so Knight's Landing is about 16x faster, but that's not really a fair comparison. NVIDIA recommends Kepler for double-precision workloads.
So interestingly, Knights Landing would be a top-tier graphics card (in terms of shading performance) if it was compatible with typical graphics APIs. Of course, it's not, and it will be priced way higher than, for instance, the AMD Radeon Fury X. Knight's Landing isn't available on Intel ARK yet, but previous models are in the $2000 – $4000 range.
i can haz?
Hypothetically,
i can haz?
Hypothetically, could one install the AIB Knights Landing onto a board using an socketed, bootable Knights Landing? Is Intel finally going to debut some form of SLI/CF? Or will that be handled as though they were just multiple CPUs? Would one be seen as a CPU and the AIBs would be seen as GPUs? Windows 7 64-bit can support 256 cores, but only two CPUs. Not sure about Windows 10…
Help me Obi Michaud, my world is getting rekt.
Don’t know the exact details.
Don't know the exact details. For example, how does a socketed Xeon Phi handle PCIe lanes? I expect that the socketed CPUs will appear as 72-core, 4 threads per core, CPUs with 512-bit registers, while the PCIe AIBs will show up as a discrete coprocessor.
I also believe that Intel claimed, a while ago, that you could mix-and-match socketed Xeon Phis with socketed Xeons. Haven't heard anything since, so that could be scrapped, but it would also raise interesting questions about how Windows would load-balance. Is it smart enough to know which threads to load with big and few / small and many tasks?
Windows might not, but few
Windows might not, but few people will be trying to run windows on one of these. Linux, on the other hand, does support that and is way more likely to be running on one of these.
It’s not made of 512-bit
It’s not made of 512-bit processors, as that would imply that the general purpose registers would be 512 bits, and they are 64 bit processors(72 Silvermont Atom cores with SMT) with AVX 512 bit units. 1152 32 bit FP registers/SIMD for low to barely mid range GPU shader counts on consumer GPU cards, and 72 x86 cores with SMT, how much power does this draw. For the Phi’s price how much GPU compute can be had with a GPU, and the GPU also has dedicated tessellation/other graphics units to go along with GPU’s vector/FP other units.
It does show how Gimped Nvidia GPUs are for compute resources compared to AMD’s GPUs but I’ll be waiting to see how this does against AMD’s workstation/HPC Zen APUs on an Interposer with a Greenland GPU accelerator, and AMD could easily add more DP FP resources to any dedicated HPC SKUs. Look for both AMD and Nvidia to have more FP resources on their HPC GPU offerings in response, but still offer a better price.
Really how is the Intel SKU going to do without dedicated tessellation/other graphics units compared to GPUs, and there are also multicore ARM based SKUs on the market from Cavium and others. How does the Phi benchmark against 2, 12 core Power8’s and the power8+’s are on the way, with power9’s to follow with Nvidia Pascal(on power 8/8+) and Volta GPU accelerators with power9s.
Are you seriously asking how
Are you seriously asking how this will benchmark against non-existent hardware?
How do you Know that, there
How do you Know that, there may be engineering samples being run through some testing currently, things do arrive in the labs long before they hit the shelves or appear in HPC/server systems!
So…
Intel just smoked
So…
Intel just smoked Nvidia & AMD in Compute in one shot?!
Wow! Impressive!
No not really, look at the FP
No not really, look at the FP performence of Consumer AMD SKUs, and the Pro Nvidia or AMD SKUs before you jump on that bandwagon! Oh and reach deep down in your pockets and the banks loan department before getting the Phi. The Phi is made of ATOM cores, so maybe Intel will Contra revenue them, not likely though!
Price for a product like this
Price for a product like this is secondary.
With Intel making it easier to program for than either proprietary CUDA or AMD’s new kit, the cost savings in salary for coding will far outweigh the additional purchase cost.
Intel is being very smart in how they are moving into this market. Surprising actually.
Look the HPC/supercomputer
Look the HPC/supercomputer industry has used GPU accelerators for years, and the software middle-ware to make the programming process easy has been available likewise. So CUDA or OpenCL, and Now Vulkan and other HSA software tools have and will continue allow for ease of use.
The HPC/supercomputer Industry is developing mostly open source software tools that are shared/developed by all the industry. Intel still with Knights Landing can not best the price/performance of GPU accelerators, and that is before the power usage metric comes into play, and for exascale computing the GPUs will still have more FP units running at lower clock rates producing more Floating Point operations per second with less power used. It still takes a lot of Die space to implement the x86 ISA, and there will be ARM based solutions that will have more cores like the Phi, but the GPU based solutions still have more FP units and those units do not have to be clocked at as power using speeds. The Phi’s estimated(300 watts) is a lot.
“Update: (November 18th @ 1:51 pm EST) I just realized that, while I kept saying “one of the reasons”, I never elaborated on the other points. Knights Landing also has four threads per core. So that “72 core” is actually “288 thread”, with 512-bit registers that can perform sixteen 32-bit SIMD instructions simultaneously. While hyperthreading is not known to be 100% efficient, you could consider Knights Landing to be a GPU with 4608 shader units. Again, it’s not the best way to count it, but it could sort-of work.”
72 times 16 times 2(512 bit)vector units per core = 2034(32 bit slices), where is the 4068 number coming from? The 4 processor threads per core are sharing the core’s 2 AVX units, they do not have their own AVX units! So each of the Phi’s cores has 2(512 bit) SIMD units for total of 32 single percision FP calculations per core per clock times the 72 total cores.
edit: 2034
to: 2304
edit: 2034
to: 2304
Is that first sentence
Is that first sentence lacking a preposition near the end? “at” maybe?
Bye bye nvidia in the HPC
Bye bye nvidia in the HPC market…
Because its just not core count, but efficiency.
For matrix streaming benchmark nvidia might show OK numbers,
but in real world, nvidia is going to get decimated, even in single precision.
And its not just the chip, but also the entire platform.
AMD wont suffer to much, since they dont own much of the HPC market, but nvidia.. 2016 wont be a good year.
‘4096’ “shaders”- where/how
‘4096’ “shaders”- where/how does it access the polygons- Nvidia cards have VRAM and the coming cards have access to ram storage.
Is this add in, going to access ram storage, to enable ‘shading’?
Seems the Add In will need some changes to fundamental MBs constructs in a big way,for actual ‘shader use, JMO.
I actually said 4608
I actually said 4608 (although 2304, 144, and several other numbers work too, depending on how you count). And yes, it doesn't have the fixed-function ASICs that a GPU does, but "shaders" refer to unified shaders, like in "compute shaders".
Shader instructions that have
Shader instructions that have to be implemented with software on the Xeon Phi because the Phi is just the x86 general purpose ISA, and does not have dedicated shader/other instructions that are implemented on the GPU hardware like GPUs have. So no dedicated in hardware tessellation, ROP, other GPU units. GPUs can do graphics oriented instructions and also FP/INT number crunching with specialized instructions and hardware implemented in the GPU’s graphics units. The GPUs dedicated in hardware graphics units, in addition to FP/INT/other in hardware units make for more versatility for the GPU. So the Phi may be able to do number crunching but only for non visually presented computations, for any simulation of 2D/3D data shown there will still be a need for the Phi to call upon a GPU for the presentation of such data visually, while the GPU will be able to do both number crunching and the visual presentation of data.
I will note that Intel will have to use its FPGA’s in addition to the Xeon Phi’s x86 cores if it wants a little more efficiency for graphics presentations without the help of a GPU. However FPGAs still are not as efficient as ASICs, but FPGAs are more versatile for their ability to be re-programed. AMD will be including FPGAs with their HPC APUs on an Interposer and Greenland/Arctic Islands graphics. So AMD will have a Full CPU, GPU, and FPGA ability with the FPGA added to the HBM stacks sandwiched between the HBM’s bottom logic/PHY chip and the HBM memory dies above. AMD’s new Arctic Islands GPU Micro-architecture will have even more CPU like abilities added in addition to graphics functionality so expect that once 14nm GPUs arrive for the Phi to be placed at even more of a disadvantage as the GPU SP/other units move into the 10,000+ counts range on the HPC GPU SKUs, and interposers will allow for more than one monolithic GPU die to be added on the interposer package.
Unless the architecture
Unless the architecture replicates 4 times also the register file, it seems overoptimistic to consider the presence of 4 threads per core to imply 4 times increase in “instruction execution capability”.
However, very interesting.