Update (June 22nd @ 12:36 AM): Errrr. Right. Accidentally referred to the CPU in terms of TFLOPs. That's incorrect — it's not a floating-point decimal processor. Should be trillions of operations per second (teraops). Whoops! Also, it has a die area of 64sq.mm, compared to 520sq.mm of something like GF110.
So this is an interesting news post. Graduate students at UCDavis have designed and produced a thousand-core CPU at IBM's facilities. The processor is manufactured on their 32nm process, which is quite old — about half-way between NVIDIA's Fermi and Kepler if viewed from a GPU perspective. Its die area is not listed, though, but we've reached out to their press contact for more information. The chip can be clocked up to 1.78 GHz, yielding 1.78 teraops of theoretical performance.
These numbers tell us quite a bit.
The first thing that stands out to me is that the processor is clocked at 1.78 GHz, has 1000 cores, and is rated at 1.78 teraops. This is interesting because modern GPUs (note that this is not a GPU — more on that later) are rated at twice the clock rate times the number of cores. The factor of two comes in with fused multiply-add (FMA), a*b + c, which can be easily implemented as a single instruction and are widely used in real-world calculations. Two mathematical operations in a single instruction yields a theoretical max of 2 times clock times core count. Since this processor does not count the factor of two, it seems like its instruction set is massively reduced compared to commercial processors. If they even cut out FMA, what else did they remove from the instruction set? This would at least partially explain why the CPU has such a high theoretical throughput per transistor compared to, say, NVIDIA's GF110, which has a slightly lower TFLOP rating with about five times the transistor count — and that's ignoring all of the complexity-saving tricks that GPUs play, that this chip does not. Update (June 22nd @ 12:36 AM): Again, none of this makes sense, because it's not a floating-point processor.
"Big Fermi" uses 3 billion transistors to achieve 1.5 TFLOPs when operating on 32 pieces of data simultaneously (see below). This processor does 1.78 teraops with 0.621 billion transistors.
On the other hand, this chip is different from GPUs in that it doesn't use their complexity-saving tricks. GPUs save die space by tying multiple threads together and forcing them to behave in lockstep. On NVIDIA hardware, 32 instructions are bound into a “warp”. On AMD, 64 make up a “wavefront”. On Intel's Xeon Phi, AVX-512 packs 16, 32-bit instructions together into a vector and operates them at once. GPUs use this architecture because, if you have a really big workload, you, chances are, have very related tasks; neighbouring pixels on a screen will be operating on the same material with slightly offset geometry, multiple vertexes of the same object will be deformed by the same process, and so forth.
This processor, on the other hand, has a thousand cores that are independent. Again, this is wasteful for tasks that map easily to single-instruction-multiple-data (SIMD) architectures, but the reverse (not wasteful in highly parallel tasks that SIMD is wasteful on) is also true. SIMD makes an assumption about your data and tries to optimize how it maps to the real-world — it's either a valid assumption, or it's not. If it isn't? A chip like this would have multi-fold performance benefits, FLOP for FLOP.
Wasn’t this basically what
Wasn’t this basically what LARRABEE was?
Nope. Larrabee was planned to
Nope. Larrabee was planned to use an AVX-512-like instruction set, allowing SIMD die-area benefits.
What happened to Adapteva’s
What happened to Adapteva’s Epiphany multicore coprocessor designs with up to 4,096 processors on a single chip connected through a high-bandwidth on-chip network.
Are they still around?
Can it be used to create a
Can it be used to create a basic 4 function calculator (add, subtract, divide, multiply)?
according to this it has a
according to this it has a 13.1W power envelope: http://vcl.ece.ucdavis.edu/misc/many-core.html
The paper presented at VLSI
The paper presented at VLSI Symposium is available:
http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2016.vlsi.symp.kiloCore.pdf
Total chip area is 7.94 x 7.82 mm
So I kinda see WHY this would
So I kinda see WHY this would be useful for certain theoretical tasks. There is more and more need for higher speed multiple SIMPLE calculations in scientific study, mostly in cataloging and comparative calculation, Cambridge “1 million genome” project springs to mind, YET they primarily use gpu technology for their needs. Perhaps they are building for a future where they need more complicated shit done faster…
What is the architecture of this cpu anyways? Power, X86, X64x86, some kinda ARM, or something homebrewed?
One correction – 1.78
One correction – 1.78 trillion instructions per second. I think this is an integer-only chip. (The paper talks about software implementation of floating point.)
Er… right. It’s not a
Er… right. It's not a floating-point processor. D'oh.
“32nm is quite old”?
“32nm is quite old”? PUH-EFFING-LEASE.
i7 2600K-god
The first 32nm prototype
The first 32nm prototype parts were produced in 2004. The first consumer production 32nm parts came in early 2010 with Intel’s 32nm first-gen Core i-series Clarkdale/Westmere chips (see i5 660/670 and i3 530/540).
32nm is, at best, 6+ years old, and at most over 12 years old. In the world of process node technology, 32nm really is quite old.
They are using an IBM lab, so
They are using an IBM lab, so its probably some unused/little used equipment at IBM’s facilities, what with 32nm being an old process, it’s also a less expensive process node to work with for students and their teachers.
They are developing an architecture and not a process node, so 32nm is good for students to work with and IBM’s 32nm SOI process is very mature with little chances of some of the fabrication issues that come with the leading edge process nodes.
So for the grad/doctoral students and their professors IBM’s 32nm process is good for this project. Once the work is done and the 1000 core network of CPU cores processor is certified then the software work can begin to test the architecture across many different types of usage models. Look at where the funding is coming from and you can see where the potential uses for this system are.
Oh I’m not arguing that it’s
Oh I’m not arguing that it’s not the right process or anything like that at all. I’m only arguing against Master Chen’s implication that the statement, “32nm is quite old” is laughable. It’s not. As far as process node technology goes, it is pretty old.
i7 2600K is globally admitted
i7 2600K is globally admitted as being one of the most successful and best top-tier (non-Extreme line) CPUs Intel ever made. And it’s been deemed so for all the rightful reasons. All Sandies above i3s are still highly sought-for and absolutely relevant on the worldwide PC enthusiast market, up to this very day. But you can keep on living in delusional denial of the harsh reality. No one would take THAT away from you, that’s for sure.
Wow. Your penetrating
Wow. Your penetrating insight into the topic is truly enlightening. Thank you for educating me on something that not only did I already know, but that I never claimed otherwise.
Sandy Bridge was a fantastic architecture, and still is. You’re right about that, and I never said it wasn’t.
None of that has to do with the fact that 32nm is a pretty old process node now. Accepting the fact that it’s pretty old now, and saying it out loud, does not in any way change or deny the fact that it was a very good node, and that Intel did great things with it.
The 32nm process is still 6+ years old.
Now. Please rethink your snide, condescending, and very very misplaced attitude.
Wow you’re ignorant. Must be
Wow you’re ignorant. Must be great to live in such heavy denial of the reality, lol.
“heavy denial of the
“heavy denial of the reality”?
Really?
The 32nm process node is more than 6 years old. Reality.
In terms of these process nodes, 6 years old is pretty old. Reality.
I am happily admitting this reality. Who in this conversation is denying it?
This whole post doesn’t make
This whole post doesn’t make much sense with the GPU comparisons being irrelevant. I am not sure what the best option is in this case though. I assume that there are some applications which would benefit from such a device, although that set may be extremely niche. Examples may be difficult to find.
If you look at a modern general purpose processor the size of an individual core is actually tiny compared to the caches required to keep running without long stalls. When you buy a CPU these days, you are actually buying a memory chip. In many cases, the cache hierarchy on chip, along with other “non-core” stuff like cache prefetchers and such, will actually be more important than what core the memory system is attached to. Most applications do not achieve very high IPC even on modern cores. This is why going with wider designs would be pointless; you are immediately bottlenecked by the memory system. Without any talk about caches, this devices may have limited real world use. We could see some such devices with a massive number of cores in the commercial market. The new ARM core is tiny, so many of them could fit on a die, but the caches to make them useful for most processing will limit the number of cores.