At an event in San Jose on Wednesday, Qualcomm and partners officially announced that its Centriq 2400 server processor based on the Arm-architecture was shipping to commercial clients. This launch is of note as it becomes the highest-profile and most partner-lauded Arm-based server CPU and platform to be released after years of buildup and excitement around several similar products. The Centriq is built specifically for enterprise cloud workloads with an emphasis on high core count and high throughput and will compete against Intel’s Xeon Scalable and AMD’s new EPYC platforms.
Paul Jacobs shows Qualcomm Centriq to press and analysts
Built on the same 10nm process technology from Samsung that gave rise to the Snapdragon 835, the Centriq 2400 becomes the first server processor in that particular node. While Qualcomm and Samsung tout that as a significant selling point, on its own it doesn’t hold much value. Where it does come into play and impact the product position with the resulting power efficiency it brings to the table. Qualcomm claims that the Centriq 2400 will “offer exceptional performance-per-watt and performance-per dollar” compared to the competition server options.
The raw specifications and capabilities of the Centriq 2400 are impressive.
|Centriq 2460||Centriq 2452||Centriq 2434|
|Process Tech||10nm (Samsung)||10nm (Samsung)||10nm (Samsung)|
|Base Clock||2.2 GHz||2.2 GHz||2.3 GHz|
|Max Clock||2.6 GHz||2.6 GHz||2.5 GHz|
|Memory Speeds||2667 MHz
|Cache||24MB L2, split
|23MB L2, split
|20MB L2, split
|PCIe||32 lanes PCIe 3.0||32 lanes PCIe 3.0||32 lanes PCIe 3.0|
Built on 18 billion transistors a die area of just 398mm2, the SoC holds 48 high-performance 64-bit cores running at frequencies as high as 2.6 GHz. (Interestingly, this appears to be about the same peak clock rate of all the Snapdragon processor cores we have seen on consumer products.) The cores are interconnected by a bi-directional ring bus that is reminiscent of the integration Intel used on its Core processor family up until Skylake-SP was brought to market. The bus supports 250 GB/s of aggregate bandwidth and Qualcomm claims that this will alleviate any concern over congestion bottlenecks, even with the CPU cores under full load.
The caching system provides 512KB of L2 cache for every pair of CPU cores, essentially organizing them into dual-core blocks. 60MB of L3 cache provides core-to-core communications and the cache is physically divided around the die for on-average faster access. A 6-channel DDR4 memory systems, with unknown peak frequency, supports a total of 768GB of capacity.
Connectivity is supplied with 32 lanes of PCIe 3.0 and up to 6 PCIe devices.
As you should expect, the Centriq 2400 supports the ARM TrustZone secure operating environment and hypervisors for virtualized environments. With this many cores on a single chip, it seems likely one of the key use cases for the server CPU.
Maybe most impressive is the power requirements of the Centriq 2400. It can offer this level of performance and connectivity with just 120 watts of power.
With a price of $1995 for the Centriq 2460, Qualcomm claims that it can offer “4X better performance per dollar and up to 45% better performance per watt versus Intel’s highest performance Skylake processor, the Intel Xeon Platinum 8180.” That’s no small claim. The 8180 is a 28-core/56-thread CPU with a peak frequency of 3.8 GHz and a TDP of 205 watts and a cost of $10,000 (not a typo).
Qualcomm had performance metrics from industry standard SPECint measurements, in both raw single thread configurations as well as performance per dollar and per watt. I will have more on the performance story of Centriq later this week.
More important than simply showing hardware, Qualcomm and several partners on hand at the press event as well as many statements from important vendors like Alibaba, HPE, Google, Microsoft, and Samsung. Present to showcase applications running on the Arm-based server platforms was an impressive list of the key cloud services providers: Alibaba, LinkedIn, Cloudflare, American Megatrends Inc., Arm, Cadence Design Systems, Canonical, Chelsio Communications, Excelero, Hewlett Packard Enterprise, Illumina, MariaDB, Mellanox, Microsoft Azure, MongoDB, Netronome, Packet, Red Hat, ScyllaDB, 6WIND, Samsung, Solarflare, Smartcore, SUSE, Uber, and Xilinx.
The Centriq 2400 series of SoC isn’t perfect for all general-purpose workloads and that is something we have understood from the outset of this venture by Arm and its partners to bring this architecture to the enterprise markets. Qualcomm states that its parts are designed for “highly threaded cloud native applications that are developed as micro-services and deployed for scale-out.” The result is a set of workloads that covers a lot of ground:
- Web front end with HipHop Virtual Machine
- NoSQL databases including MongoDB, Varnish, Scylladb
- Cloud orchestration and automation including Kubernetes, Docker, metal-as-a-service
- Data analytics including Apache Spark
- Deep learning inference
- Network function virtualization
- Video and image processing acceleration
- Multi-core electronic design automation
- High throughput compute bioinformatics
- Neural class networks
- OpenStack Platform
- Scaleout Server SAN with NVMe
- Server-based network offload
I will be diving more into the architecture, system designs, and partner announcements later this week as I think the Qualcomm Centriq 2400 family will have a significant impact on the future of the enterprise server markets.
120watts of ARM (even if it’s
120watts of ARM (even if it’s 48 cores) seems like a lot.
This product will have a very
This product will have a very hard time against the epyc-7501 at the technical/performance level.
Also from qualcomm admition their fastest 48core model is about 5 time slower then a 28core Xeon Platinum 8180.
Almost twice the core, but 5x slower. And both have near equal base clock (2.2 vs 2.5)
That mean ARM is still ~10x slower clock for clock then x86.
Good think qualcomm got money….
To squueze that high number
To squueze that high number of cores into a single die (as such that it would not be too large to fabricate), some compromises had to be made especially in the CPU’s core design. Thus its not surprising that performance per core would not be as great. Examples from Cavium’s ThunderX and Applied Micro’s Gene-X demonstrates this clearly.
Latest update, Qualcomm’s
Latest update, Qualcomm’s Centriq benchmarks galore here https://blog.cloudflare.com/arm-takes-wing/ and those are not “estimates” on marketing slides. As expected per core performance still behind Intel’s x86 cores but often makes up that deficiency by having lots of cores.
The Centriq 2460, Centriq
The Centriq 2460, Centriq 2452, and Centriq 2434 are they that much different core for core than the Arm Holdings Reference design cores. And those “raw specifications” need to be more complete like what are these SKU’s Falkor cores on-core/per-core Decoder width/# of Instruction decoders, the instruction issue rate(As in how many decoded micro-ops can be issued per clock), and how many Int/ALUs FPU/FP Unit(Flops rate) on that per-clock metric, what about the micro-ops/reorder(for OOO) buffer size and instruction retire rate, and other VERY Important Folkor core execution resources and core block diagrams etc.
Here is the Apple A7 Cyclone info:
Issue Width—————–6 micro-ops,
Reorder Buffer Size———192 micro-ops,
Branch Mispredict Penalty—16 cycles (14 – 19),
Load Latency—————-4 Cycles,
Indirect Branch Units——-1,
L1 Cache-–——————64KB I$ + 64KB D$,
Let’s be realistic, Anand Lal Shimpi(before he left Anandtech) had to coax these figures out using software methods because Apple was not that forthcoming with its A series core specifications. But why is the ARM market so black box about it’s “Custom” core designs when both AMD and Intel provide so much more complete imformation on their respective x86 ISA running core designs.
So Qualcomm is hiding something as are most of the ARM based industry “Custom” ARM core makers and the lack of infromation is very telling! And there are folk in the server review Industry that have a lot more skill than even Anand Lal Shimpi who can use the same methods and are not under any NDAs. This is one of the very reasons that the Custom ARM server core makers will be behind the x86 server core makers and they both are behind IBM in makeing the proper amount of specification and white papers available.
Readers can note that the Apple A7 Cyclone is a fully Apple custom core micro-arch design and that the A7 is ewice as wide order superscalar as the ARM Holdings refrence designs. And Nvidia’s Denver(V1) cores are a little bit wider order superscalar than Apple’s A7 Cyclone core.
These Markiting Specs are even sparce by usual Marketing obfuscation specifications are. What gives with you custom ARMv8A ISA running micro-arch based folks. IBM/OpenPower power8s/9s RISC ISA Running designs are nearly twice as wide order superscalar than even Apple’s A7 Cyclone design! And IBM/Openpower’s power8s/power9s even support SMT8(8 processor threads per core| power8 and one Power9 SMT8 variant) and SMT4(4 processor threads per core| the other power9 SMT4/24 core variant) and anyone and their dog can go online and get tons of 300+ page whitepapers on the Power8 and Power9 processors!
Oh I wish that AMD would at least announce its K12 custom design just so folks could see how much more info is provided. And IBM/OpenPower has got them all beat for proper and cogently constructed documantation and manuals. Those Old mainframe guys had the best technical writers ever and nothing was dumbed down and the glossaries and dictionaries of IBM Technical Terminology ran way over 50 pages on the IBM 4341 Sierra group 12 that RAN OS VS-1 and VM370. The IBM SOG and other related manuals took up some footage in the manual room shelves but each and every maunal was edited and authored by folks in a proper way.
Old man Smell/out!
I think most people missed
I think most people missed this one, Socionext’s SynQuacer http://socionextus.com/products/data-center-solutions/synquacer-multicore-arm-processor/ with 24 ARM Cortex-A53 cores.
Shouldn’t : The caching
Shouldn’t : The caching system provides 512KB of L2 cache for every pair of CPU cores, essentially organizing them into dual-core blocks
Be : The caching system provides 1MB of L2 cache for every pair of CPU cores, essentially organizing them into dual-core blocks
And isn’t it possible that they’re using 5 or 6 groups of 8 core? The 2452 seems kind of the odd man out.