Intel recently provided a few insights into its upcoming Nervana Neural Network Processor (NNP) on its blog. Built in partnership with deep learning startup Nervana Systems which Intel acquired last year for over $400 million, the AI-focused chip previously codenamed Lake Crest is built on a new architecture designed from the ground up to accelerate neural network training and AI modeling.
The full details of the Intel NNP are still unknown, but it is a custom ASIC with a Tensor-based architecture placed on a multi-chip module (MCM) along with 32GB of HBM2 memory. The Nervana NNP supports optimized and power efficient Flexpoint math and interconnectivity is huge on this scalable platform. Each AI accelerator features 12 processing clusters (with an as-yet-unannounced number of "cores" or processing elements) paired with 12 proprietary inter-chip links that 20-times faster than PCI-E, four HBM2 memory controllers, a management-controller CPU, as well as standard SPI, I2C, GPIO, PCI-E x16, and DMA I/O. The processor is designed to be highly configurable and to meet both mode and data parallelism goals.
The processing elements are all software controlled and can communicate with each other using high speed bi-directional links at up to a terabit per second. Each processing element has more than 2MB of local memory and the Nervana NNP has 30MB in total of local memory. Memory accesses and data sharing is managed with QOS software which controls adjustable bandwidth over multiple virtual channels with multiple priorities per channel. Processing elements can talk to and send/receive data between each other and the HBM2 stacks locally as well as off die to processing elements and HBM2 on other NNP chips. The idea is to allow as much internal sharing as possible and to keep as much data stored and transformed in local data as possible in order to save precious HBM2 bandwidth (1TB/s) for pre-fetching upcoming tensors, reduce the number of hops and resulting latency by not having to go out to the HBM2 memory and back to transfer data between cores and/or processors, and to save power. This setup also helps Intel achieve an extremely parallel and scalable platform where multiple Nervana NNP Xeon co-processors on the same and remote boards effectively act as a massive singular compute unit!
Intel's Flexpoint is also at the heart of the Nervana NNP and allegedly allows Intel to achieve similar results to FP32 with twice the memory bandwidth while being more power efficient than FP16. Flexpoint is used for the scalar math required for deep learning and uses fixed point 16-bit multiply and addition operations with a shared 5-bit exponent. Unlike FP16, Flexpoint uses all 16-bits of address space for the mantissa and passes the exponent in the instruction. The NNP architecture also features zero cycle transpose operations and optimizations for matrix multiplication and convolutions to optimize silicon usage.
Software control allows users to dial in the performance for their specific workloads, and since many of the math operations and data movement are known or expected in advance, users can keep data as close to the compute units working on that data as possible while minimizing HBM2 memory accesses and data movements across the die to prevent congestion and optimize power usage.
Intel is currently working with Facebook and hopes to have its deep learning products out early next year. The company may have axed Knights Hill, but it is far from giving up on this extremely lucrative market as it continues to push towards exascale computing and AI. Intel is pushing for a 100x increase in neural network performance by 2020 which is a tall order but Intel throwing its weight around in this ring is something that should give GPU makers pause as such an achievement could cut heavily into their GPGPU-powered entries into this market that is only just starting to heat up.
You won't be running Crysis or even Minecraft on this thing, but you might be using software on your phone for augmented reality or in your autonomous car that is running inference routines on a neural network that was trained on one of these chips soon enough! It's specialized and niche, but still very interesting.
- Intel Launches Stratix 10 FPGA With ARM CPU and HBM2
- Intel's Nervana chip targets Nvidia on artificial intelligence
- New AI products will Crest Computex
- Intel to Ship FPGA-Accelerated Xeons in Early 2016
- Intel Kills Knights Hill, Will Launch Xeon Phi Architecture for Exascale Computing @ ExtremeTech
- NVIDIA Discusses Multi-Die GPUs
“16-bits of address space for
“16-bits of address space for the mantissa” addressing uses integers, did you mean register/storage(In the tesnsor core/s) cache/storage?
And this from the Intel link on its FlexPoint IP:
“It is worth noting that a Flexpoint tensor is essentially a fixed point, not floating point, tensor. Even though there is a shared exponent, its storage and communication can be amortized over the entire tensor, a negligible overhead for huge tensors. Most of the memory on device is used to store tensor elements with higher precision that scales with the dimensionality of tensors (typically huge for deep neural networks). The external storage (on host) of the shared exponents and statistics deque requires a small memory that is constant for each tensor.” (1)
1 See Intel’s Flexpoint link in the main article.
I think you are missing a
I think you are missing a word or two in the second paragraph. What exactly is the interconnect that is supposedly 20x faster than pci-e? That seems unlikely unless it is actually between clusters on the same die or interposer.
I’m guessing it is like
I’m guessing it is like Nvidia’s NVLink. NVLink allows each device to have up to 150 GB/s of aggregate interchip (GPU to GPU or GPU to CPU) within a single server node. That arises from 6 bidirectional sublinks of 25 GB/s each, which can be ganged up between chips or used to connect multiple chips as desired. That is a little under 10 times the bandwidth allowed by a PCI Express 3.0 16x connection. Perhaps Intel is creating a similar system that promises twice the bandwidth of the currently available version of NVLink.