Today at the Xilinx Developer Forum, Xilinx CEO Victor Peng announced a new product family named Versal. Originally revealed earlier in the year as Project Everest, Versal is the first family of devices in what Xilinx is coining as the Adaptive Compute Acceleration Platform (ACAP) market.
ACAPs are a new product segment to solve some of the core difficulties that Xilinx has observed with development via their current FPGA devices. FPGAs traditionally excel in the hands of developers who are more oriented in the hardware world rather than the software world. However, these hardware developers make up only a small percentage compared to the total amount of software developers.
Built from the ground up with complete software programmability in mind, the concept of an ACAP aims to fix this through easy to use software tools, libraries, and runtimes, allowing both the likes of hardware and software developers, as well as data scientists to leverage the power of application acceleration.
In general, ACAPs aim to offer similar performance levels of an ASIC, while still maintaining the highly programmable nature of an FPGA.
Versal, the first device under this ACAP designation, has been developed by Xilinx in a time that they see as the "the era of Heterogeneous compute." Versal tackles this prospect of heterogeneous compute through the use of Scalar Processing Engines, Adaptable Hardware Engines, Intelligent engines, and the integration of advanced interfaces. Versal is built on the cutting edge 7nm FinFET technology from TSMC.
Continue reading our preview of Xilinx Versal ACAP!
Scalar Processing Engines
Like we've seen in other Xilinx products in the past, Versal integrates Arm Cortex A-series and R-series processors. In this particular arrangement, we have a dual-core A72 and dual-core R5 design.
These ARM cores can be used for traditional tasks where CPUs generally excel, like complex algorithms and general purpose compute due to the high level of software compatibility found with the ARM platform.
Adaptable Hardware Engines
The adaptable hardware engines of Versal is where you'll find the type of programmable logic that Xilinx is known for. This highly customizable logic engine can be tailored to a particular set of given parameters, making it a high performance and low latency solution to a specifically targeted application. This portion of Versal can be conflated a traditional FPGA offering, but with some major changes.
The adaptable hardware layer in Versal now allows for 8x faster dynamic reconfiguration compared to other Xilinx FPGA products, and actually has the ability to reconfigure any portion of the platform on-the-fly, as the rest of the system continues operation allowing for greater flexibility while maintaining mission-critical uptime.
Customers can also create custom memory hierarchies within the adaptable hardware layer, allowing for further customization to a given application or workload.
The intelligent engine portion of Versal includes both the advanced DSP capability for low-latency high-precision floating point processing, as well as dedicated fixed function hardware for AI inference.
The Versa AI Engines consist of an array of SIMD vector processing cores, running at 1Ghz, each connected to a portion of local memory.
Xilinx claims this modular design helps them scale inferencing ability from low power Edge devices (~5W) all the way up to the data center through discrete add-in cards (~150W)
Advanced Protocol Engines
One of the biggest pushes of Versal is the adoption of many industry-leading interfaces to allow developers to communicate with high speed and low latency to whatever device is necessary.
Not only do we see 16 lanes of PCIe Gen4, AXI-DMI, and CCIX, but IO options on Versal include multi-rate 100Gb Ethernet, MIPI D-PHY for cameras and sensors, LVDS, and even down to 3.3V GPIO for legacy applications.
As you would expect, Xilinx is including their highest end transceiver technology into Versal, up to 112G PAM4 for the highest performance levels. Similarly, Xilinx is bringing their high-end RF signal chain work to the table, with Multi-Gigasample/sec ADC/DACs, Integrated Digital down/up conversion, and even SD-FED for 5G applications.
The supported memory includes DDR4 up to speeds of 3200, LPDDR4 up to 4266, and HBM (on specific SKUs). These fast memory options will help enable the Versal to deal with large data sets while maintaining high speed and low latency.
All of these different interface and memory options are accessible through Versal's new Network-on-Chip (NoC). Essentially, the NoC ties all of the different engines of Versal together, through a common memory-mapped interface.
The NoC will work out of the box with no place and route work needed to get up and running to access the different interfaces. Not only will the NoC be available immediately, but it will offer aggregate bandwidth of 1Tb/s with guaranteed Quality of Service.
Xilinx is claiming that the implementation of the NoC in silicon provides an 8x power benefit over soft implementations of similar functionality.
Xilinx today is launching two different product families of ACAP, Versal Prime and Versal AI Core.
The main difference between these two product lines is the addition of the AI engines on the AI core product, whereas Versal Prime is a mid-range device, targeted at applicability among a broad array of applications.
Example applications of Versal Prime that Xilinx discussed included communications test equipment, data center network and storage acceleration, broadcast switches, medical imaging and more.
On the Versa AI core side, Xilinx showed off some benchmarks comparing their deep learning inference ability to the likes of those from NVIDIA.
Right off the bat, Xilinx is claiming a 43X speed-up from an 18-core, 36-thread Xeon Platinum 8124 from Intel, and a 2X throughout advantage over NVIDIA's Volta-powered Tesla V100 in image recognition for a GoogLeNet network in Caffe.
When additional constraints of latency quality of service are applied, the Xilinx Versal pulls even further ahead. The Versal is able to deliver 72X the image recognition performance than the Xeon and 2.5X the performance of the Tesla V100 when the acceptable latency window is constrained to within 7 ms.
One of the big takeaways here is Xilinx's advantage in inference within a given latency window. While the inference performance of both the Xeon CPU and NVIDIA GPU options slipped as the latency window shortened, the Versal AI cores managed to achieve the same level of inference capability within a 7ms window as a within a 2mn window.
Since one of the core goals of Versal is software adapability, Xilinx was quick to point out that they are developing a complete software stack for programming Versal.
In addition to being able to deploy software written in C, C++, and Python, Versal will also provide support for Xilinx's Vivado Design Suite for experienced hardware developers, as well as access to frameworks like TensorFlow, Caffe, and MXNet for deep leaning inference workloads using the AI cores.
The first Versal product families, Versal Prime and Versal AI core are set to ship in volume in the second half of 2019, with additional Versal product families coming in the years to follow.
Given the relatively long lead time until Versal ships, Xilinx might end up facing some stark competition in some of their targeted areas like AI inference with the move to 7nm for more and more products like GPUs.
All in all, Versal seems to be an impressive next step for Xilinx. Given the importance of high performance, low latency data processing in rapidly developing market segments like Advanced driver-assistance systems, 5G wireless rollouts, and Image recognition, having easy to use, but highly configurable hardware solutions like ACAPs could prove vital to moving the industry forward.
Are you really sure that
Are you really sure that “[t]his highly customizable logic engine can be highly customized“?
Loads more Epyc 2P sales and
Loads more Epyc 2P sales and AI Infrencing Workloads(1). So that’s going to compete with Nvidia/V100-Tesla and Intel/Altera(Now Owned By Intel).
Don’t forget that AMD is also exploring FPGA’s right there on the HBM2/Newer-HBM stacks for some In-Memory compute. I’d love to see GPUs get some In VRAM Processing abilities and maybe that’s going to be useful for any AI based denoising or “DLSS” like processing that can be done right on the HBM2/Newer Stacks/VRAM.
This Xilinx SKUs looks like it could be progremmed to do BVH Ray Tracing/Path Tracing workloads also.
“30,000 Images/Second: Xilinx and AMD Claim AI Inferencing Record”