UCDavis Manufactures a 1000-Core CPU

Posted by Scott Michaud | Jun 22, 2016 | Processors | 18

ucdavis1

Source: UCDavis

Update (June 22nd @ 12:36 AM): Errrr. Right. Accidentally referred to the CPU in terms of TFLOPs. That's incorrect — it's not a floating-point decimal processor. Should be trillions of operations per second (teraops). Whoops! Also, it has a die area of 64sq.mm, compared to 520sq.mm of something like GF110.

So this is an interesting news post. Graduate students at UCDavis have designed and produced a thousand-core CPU at IBM's facilities. The processor is manufactured on their 32nm process, which is quite old — about half-way between NVIDIA's Fermi and Kepler if viewed from a GPU perspective. Its die area is not listed, though, but we've reached out to their press contact for more information. The chip can be clocked up to 1.78 GHz, yielding 1.78 teraops of theoretical performance.

These numbers tell us quite a bit.

The first thing that stands out to me is that the processor is clocked at 1.78 GHz, has 1000 cores, and is rated at 1.78 teraops. This is interesting because modern GPUs (note that this is not a GPU — more on that later) are rated at twice the clock rate times the number of cores. The factor of two comes in with fused multiply-add (FMA), a*b + c, which can be easily implemented as a single instruction and are widely used in real-world calculations. Two mathematical operations in a single instruction yields a theoretical max of 2 times clock times core count. Since this processor does not count the factor of two, it seems like its instruction set is massively reduced compared to commercial processors. If they even cut out FMA, what else did they remove from the instruction set? This would at least partially explain why the CPU has such a high theoretical throughput per transistor compared to, say, NVIDIA's GF110, which has a slightly lower TFLOP rating with about five times the transistor count — and that's ignoring all of the complexity-saving tricks that GPUs play, that this chip does not. Update (June 22nd @ 12:36 AM): Again, none of this makes sense, because it's not a floating-point processor.

"Big Fermi" uses 3 billion transistors to achieve 1.5 TFLOPs when operating on 32 pieces of data simultaneously (see below). This processor does 1.78 teraops with 0.621 billion transistors.

On the other hand, this chip is different from GPUs in that it doesn't use their complexity-saving tricks. GPUs save die space by tying multiple threads together and forcing them to behave in lockstep. On NVIDIA hardware, 32 instructions are bound into a “warp”. On AMD, 64 make up a “wavefront”. On Intel's Xeon Phi, AVX-512 packs 16, 32-bit instructions together into a vector and operates them at once. GPUs use this architecture because, if you have a really big workload, you, chances are, have very related tasks; neighbouring pixels on a screen will be operating on the same material with slightly offset geometry, multiple vertexes of the same object will be deformed by the same process, and so forth.

This processor, on the other hand, has a thousand cores that are independent. Again, this is wasteful for tasks that map easily to single-instruction-multiple-data (SIMD) architectures, but the reverse (not wasteful in highly parallel tasks that SIMD is wasteful on) is also true. SIMD makes an assumption about your data and tries to optimize how it maps to the real-world — it's either a valid assumption, or it's not. If it isn't? A chip like this would have multi-fold performance benefits, FLOP for FLOP.

Video News

About The Author

Scott Michaud

Scott joined PC Perspective in May 2011. Prior to PC Perspective, Scott has worked on personal projects and has completed degrees in Physics and Education from Queen's University. While he does not write for other hardware sites, Scott works full-time as a software developer for Eliot Research & Consulting. He is also a geek, go figure.

18 Comments

Anonymous on June 22, 2016 at 2:05 am

Wasn’t this basically what
Wasn’t this basically what LARRABEE was?
Reply
- Scott Michaud on June 22, 2016 at 2:08 am
  
  Nope. Larrabee was planned to
  
  Nope. Larrabee was planned to use an AVX-512-like instruction set, allowing SIMD die-area benefits.
  Reply
  - Anonymous on June 22, 2016 at 2:41 am
    
    What happened to Adapteva’s
    What happened to Adapteva’s Epiphany multicore coprocessor designs with up to 4,096 processors on a single chip connected through a high-bandwidth on-chip network.
    
    Are they still around?
    Reply
razor512 on June 22, 2016 at 2:37 am

Can it be used to create a
Can it be used to create a basic 4 function calculator (add, subtract, divide, multiply)?
Reply
biblicabeebli on June 22, 2016 at 3:30 am

according to this it has a
according to this it has a 13.1W power envelope: http://vcl.ece.ucdavis.edu/misc/many-core.html
Reply
skennedy on June 22, 2016 at 3:40 am

The paper presented at VLSI
The paper presented at VLSI Symposium is available:
http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2016.vlsi.symp.kiloCore.pdf
Total chip area is 7.94 x 7.82 mm
Reply
collie on June 22, 2016 at 3:52 am

So I kinda see WHY this would
So I kinda see WHY this would be useful for certain theoretical tasks. There is more and more need for higher speed multiple SIMPLE calculations in scientific study, mostly in cataloging and comparative calculation, Cambridge “1 million genome” project springs to mind, YET they primarily use gpu technology for their needs. Perhaps they are building for a future where they need more complicated shit done faster…

What is the architecture of this cpu anyways? Power, X86, X64x86, some kinda ARM, or something homebrewed?
Reply
Pixy Misa on June 22, 2016 at 3:57 am

One correction – 1.78
One correction – 1.78 trillion instructions per second. I think this is an integer-only chip. (The paper talks about software implementation of floating point.)
Reply
- Scott Michaud on June 22, 2016 at 4:46 am
  
  Er… right. It’s not a
  
  Er… right. It's not a floating-point processor. D'oh.
  Reply
Master Chen on June 22, 2016 at 11:46 am

“32nm is quite old”?
“32nm is quite old”? PUH-EFFING-LEASE.

i7 2600K-god
Reply
- Anonymous on June 22, 2016 at 3:12 pm
  
  The first 32nm prototype
  The first 32nm prototype parts were produced in 2004. The first consumer production 32nm parts came in early 2010 with Intel’s 32nm first-gen Core i-series Clarkdale/Westmere chips (see i5 660/670 and i3 530/540).
  
  32nm is, at best, 6+ years old, and at most over 12 years old. In the world of process node technology, 32nm really is quite old.
  Reply
  - Anonymous on June 22, 2016 at 4:24 pm
    
    They are using an IBM lab, so
    They are using an IBM lab, so its probably some unused/little used equipment at IBM’s facilities, what with 32nm being an old process, it’s also a less expensive process node to work with for students and their teachers.
    
    They are developing an architecture and not a process node, so 32nm is good for students to work with and IBM’s 32nm SOI process is very mature with little chances of some of the fabrication issues that come with the leading edge process nodes.
    
    So for the grad/doctoral students and their professors IBM’s 32nm process is good for this project. Once the work is done and the 1000 core network of CPU cores processor is certified then the software work can begin to test the architecture across many different types of usage models. Look at where the funding is coming from and you can see where the potential uses for this system are.
    Reply
    - Anonymous on June 22, 2016 at 7:08 pm
      
      Oh I’m not arguing that it’s
      Oh I’m not arguing that it’s not the right process or anything like that at all. I’m only arguing against Master Chen’s implication that the statement, “32nm is quite old” is laughable. It’s not. As far as process node technology goes, it is pretty old.
      Reply
      - Master Chen on June 23, 2016 at 1:01 am
        
        i7 2600K is globally admitted
        i7 2600K is globally admitted as being one of the most successful and best top-tier (non-Extreme line) CPUs Intel ever made. And it’s been deemed so for all the rightful reasons. All Sandies above i3s are still highly sought-for and absolutely relevant on the worldwide PC enthusiast market, up to this very day. But you can keep on living in delusional denial of the harsh reality. No one would take THAT away from you, that’s for sure.
        Reply
        
        Anonymous on June 23, 2016 at 1:51 am
        
        Wow. Your penetrating
        Wow. Your penetrating insight into the topic is truly enlightening. Thank you for educating me on something that not only did I already know, but that I never claimed otherwise.
        
        Sandy Bridge was a fantastic architecture, and still is. You’re right about that, and I never said it wasn’t.
        
        None of that has to do with the fact that 32nm is a pretty old process node now. Accepting the fact that it’s pretty old now, and saying it out loud, does not in any way change or deny the fact that it was a very good node, and that Intel did great things with it.
        
        The 32nm process is still 6+ years old.
        
        Now. Please rethink your snide, condescending, and very very misplaced attitude.
        Reply
  - Master Chen on June 23, 2016 at 12:57 am
    
    Wow you’re ignorant. Must be
    Wow you’re ignorant. Must be great to live in such heavy denial of the reality, lol.
    Reply
    - Anonymous on June 23, 2016 at 1:54 am
      
      “heavy denial of the
      “heavy denial of the reality”?
      
      Really?
      
      The 32nm process node is more than 6 years old. Reality.
      In terms of these process nodes, 6 years old is pretty old. Reality.
      
      I am happily admitting this reality. Who in this conversation is denying it?
      Reply
Anonymous on June 22, 2016 at 7:08 pm

This whole post doesn’t make
This whole post doesn’t make much sense with the GPU comparisons being irrelevant. I am not sure what the best option is in this case though. I assume that there are some applications which would benefit from such a device, although that set may be extremely niche. Examples may be difficult to find.

If you look at a modern general purpose processor the size of an individual core is actually tiny compared to the caches required to keep running without long stalls. When you buy a CPU these days, you are actually buying a memory chip. In many cases, the cache hierarchy on chip, along with other “non-core” stuff like cache prefetchers and such, will actually be more important than what core the memory system is attached to. Most applications do not achieve very high IPC even on modern cores. This is why going with wider designs would be pointless; you are immediately bottlenecked by the memory system. Without any talk about caches, this devices may have limited real world use. We could see some such devices with a massive number of cores in the commercial market. The new ARM core is tiny, so many of them could fit on a die, but the caches to make them useful for most processing will limit the number of cores.
Reply

UCDavis Manufactures a 1000-Core CPU

Video News

About The Author

Scott Michaud

18 Comments

Leave a reply Cancel reply

Latest Podcasts

Archive & Timeline

Previous 12 months

Explore: All The Years!

Shop new Deals of the Day at GameStop.com!

User login status

UCDavis Manufactures a 1000-Core CPU

Video News

About The Author

Scott Michaud

Related Posts

The early bird gets the Bulldozer

Star Wars Episode III Rendered on Windows 64

Spider ate the Quad FX

Dr. Lisa Su Is AMD’s New President and CEO

18 Comments

Leave a reply Cancel reply

Latest Podcasts

Archive & Timeline

Previous 12 months

Explore: All The Years!

Shop new Deals of the Day at GameStop.com!

User login status