Ampere recently announced the availability of its first ARM-based server processor dubbed eMAG. The new chips use 16 or 32 custom CPU cores built upon the X-Gene 3 (once pioneered by Applied Micro) compatible with the 64-bit ARMv8-A instruction set. Ampere, in partnership with Lenovo (and several smaller unspecified ODMs), has started shipping eMAG to its customers and partners. Current eMAG processors are based on TSMC 16nm FinFET+ and Ampere plans to move future eMAG processors to TSMC’s 7nm node while adding support for multi-socket servers as soon as next year.
Ampere’s eMAG processors are designed for the datacenter with big data computing workloads in mind that benefit from large amounts of memory and cores including big data analytics, web serving, and in-memory databases. The new ARM server CPU entrant is designed to compete with the likes of Intel’s Xeon and AMD’s EPYC X86-64 processors as well as other ARM-based offerings from Cavium and Qualcomm. Early reports suggest that eMAG is no slouch in performance, but where it really excels is in price to performance, performance per core per dollar, and total cost of ownership metrics.
Today’s eMAG processors feature either 16 or 32 custom ARM cores clocked at 3.0 GHz base and up to 3.3 GHz turbo with 32KB I-cache, 32KB D-cache (L1) per core, 256KB L2 cache which is shared between two paired cores, and a global shared 32MB L3 cache. There are eight DDR4 memory controllers (up to 1TB DDR4-2667 using 16 DIMMs for up to 170.7 GB/s memory bandwidth) as well as 42 lanes of PCI-E 3.0 I/O. The CPU cores, cache, and controllers are connected using a switch that is part of a coherent fabric. Additional I/O support includes four SATA 3 and two USB 2.0 along with 10GbE. The eMAG processors have a 125W TDP.
Perhaps most interesting is the pricing which Ampere has set at a rather aggressive $550 for the 16-core chip and $850 for the 32-core processor. The Ampere chips are interesting especially following Qualcomm’s seeming loss of interest in this space as it dialed back its Centriq efforts earlier this year. With a new ARM entrant that reduces the datacenter barrier to entry for workloads that need lots of acceptable performance cores paired with lots of memory and AMD’s renewed datacenter push on all fronts, Intel is going to have its work cut out for it when it comes to maintaining its datacenter dominance. At the very least it may shake up server CPU pricing. Further, perhaps beyond its intended use, these ARM-based offerings may also introduce some new server platforms that are accessible to enthusiast virtual lab-ers and small HPC developers (small shops, universities, etc) that can use lower cost systems like these for testing and research into developing highly parallelized code that will eventually be run on higher end servers in the “hyperscale” data center.
I am curious to see if the eMAG will live up to its performance claims and expectations of competing with the big players in this space. According to ExtremeTech, Ampere claims the 32-core eMAG is able to match the Intel Xeon Gold 6130 (16 core / 32 thread, 2.1-3.7 GHz, 22MB L3, and 125W TDP) in SPEC CINT2006 benchmarks. The company further claimed earlier this year that eMAG would offer up to 90% performance per dollar versus Xeon Silver and 40% higher performance per dollar compared to Xeon Gold processors from Intel.
What are your thoughts on eMAG and ARM in the server space?
Related Reading:
Cavium’s(Marvell) ThunderX2
Cavium’s(Marvell) ThunderX2 with the ThunderX2 based on the Broadcom Vulkan core design and up to 4 processor threads per core(SMT4) for the ThunderX2. So the ThunderX and ThunderX2 designs are not that related because ThunderX2 gets its DNA from the Broadcom Vulkan design that was sold to Cavium after Avago Technologies acquired Broadcom.
So yes there is a lot of Custom ARM server IP that’s getting acquired via takeovers and mergers and some like Broadcom(Avago DBA Broadcomm) sold off its Vulkan design to Cavium who was acquired by Marvell.
So once all the dust settles there is still only the ThunderX2(Vulkan based) that’s the only custom ARM core supporting SMT and let’s have someone do a proper deep dive into all the custom ARM core CPU designs that are based off of all that custom CPU core IP that’s being sold back and forth and Branded under new Marketing names.
I’m getting tired of only hearing that its designed to compete Price/Performance wise with Intel’s Xeon This or Intel Xeon that and there being several ARM server competators that fail to get compared with Price/performance figures of their own.
Let’s also get each Custom ARM Core design properly documented for things like Types of Processor Cache(exclusive, inclusive, other), the cache’s set associativity, the processor’s reorder buffer size, number of execution ports/pipelines per core, branch prediction penalty, Decoder width, Instruction issue width, SMT/Non SMT, and all of the other core metrics and also the UnCore(As Intel Calls It) parts that allow the core/s to interface with whetever peripherals are attatched over their respective protocols and PHYs.
Qualcomm’s puny little narrow order superscalar ARM Holdings Refrence core based designs where never in the game to begin with compared to the Cavium ThunderX2, or even Samsung’s Mongoose M3 core design. And any of the ARM RISC core designs are still playing catchup to Power7/Power8 which also are RISC ISA based processors that execute the Power ISA, and IBM’s/Openpower’s power9 designs are fruther away still.
They are all going to have to compete on price/performance and feature sets with AMD’s Epyc/SP3 because that represents the lowest cost x86 offerings that most of the current server market will look at first if they are looking to move away from Intel’s more costly x86 offerings. So Epyc/AMD and any server concern not having to do much in that way of any code refactoring/code recertification for non x86 ISAs to make the switch away from Intel’s Pricey Xeon Kit. I see that Ampere/eMAG do offer 8 memory channels but 42 PCIe lanes is more Intel like compared to AMD’s Epyc/SP3 Platform’s 128 PCIe lanes.
Pricewise and most likely Price/performance wise any Arm server competition has to undercut Cavium/ThunderX2’s numbers as that’s the high water mark for custom ARM server designs at the moment, unless Samsung spins up a Mongoose M4 variant with SMT capabilitise because Samsung’s current Mongoose M3 ia a rather wider order superscalar beast than most of the custom ARM market has fielded with the M3 core even matching Apple’s A series core designs for front end decoder counts with a wider issue and larger back end for the M3 core than even Apple can offer. Then there is Fujitsu’s custom ARM cores also that make use of the Arm Holdings’ new SVE(Scalable Vector Extentions) that where developed in partnership with Fujitsu for Japans’s Post-K Exascale computing project.
Basing ones Price/Performance figures off of Intel Xeon related pricing is not going to get anyone’s attention what with the server customers very knowledgeable in thier own operations and their specific operation’s TCO related figures and price/performance needs. The server customers continously tune their operating equations to factor in their specific needs and those server clients will hire the best consultants if they are not already there on staff.
“The company further claimed earlier this year that eMAG would offer up to 90% performance per dollar versus Xeon Silver and 40% higher performance per dollar compared to Xeon Gold processors from Intel.”
Yes now show your performance per dollar figures for all the other competing offerings that are not Priced out of the stratosphere like most of Intel’s Xeon offerings are! And do so based on the specific server customer’s actual workload/platform feature sets needs. I even like to see rough figures based on Total-MB’s-Cost/PCIe-Lanes-Offered as well as Total-MB’s-Cost/Memory-Channels-Offered, metrics for single and dual socket configurations. Server 2P(dual socket) configurations are the most often used so that’s what will be looked at the most.
BS
BS
Not as much as the what
Not as much as the what Ampere is putting out there trying to market its tat!
It’s a whole lot of promise
It’s a whole lot of promise for no real reward. X86 based software would need to be rewritten and that is a huge expense. So these will tend to go towards new entrants that can target their software in the development stage. Established players aren’t going to go for this type of solution.
The part that is the
The part that is the expensive part is optimization of any x86 application code for the ARM ISA as any high level language can just be recompiled targeting the ARM ISA and Most servers are running Linux Kernel based distros. It’s more costly to optimize the high level application code for a new ISA and that’s the part that requires some high level code refractoring depending on the server workload. And any software used in production server workloads has to be certified for that workload and that’s the most costly part of the process of changing to a Different ISA.
See the amount of work that M$ has to put in to its Windows on ARM ISA OS project. So the Linux Kernel based OS distros are more optimized for the ARM ISAs over the dacades than M$’s OS product. The Linux Kernel is what’s been used on the majority of ARM ISA based systems in use today on mobile and any ARM ISA based systems will be more optimized for Linux on ARM compared to other ISAs. Any Linux Kernel OS optimizations are already there for any refrence ARM cores but if any of these server providors have their own application ecosystems that they have developed over time for x86 then if they are x86 optimized then that code will need more refactoring to become optimized for the ARM ISA Core’s specific hardware.
All the custom ARM core makers that have different underlying hardware implementations that are engineered to execute the ARM ISA. So each custom ARM core maker will need to provide the compiler makers with a compiler optimization manual that is tailored to the specific optimizations of that Custom ARM maker’s core excentricities. This is true also for even AMD’s and Intel’s x86 hardware that are different on the underlying hardware level even though they both execute the x86 32/64 bit ISA. That is why AMD’s Zen x86 running micro-arch has no problems with Meltdown and little issues with Spectre, because AMD’s and Intel’s cache subsystems and Branch perdiction logic is different as are their respective Cache sizes and memory/other underlying CPU/Chipset hardware subsystems.
Any Server makers running Arm Holdings refrence design cores instead of some fully customizied ARM cores will have an eaiser time because ARM Holdings has done most of that wotk already as far as optimizing for ARM Holding’s refrence core designs. AMD still has some ARM Cortex-A57 refrence core based Opteron A1100 based server clients that AMD still is contractually obligated to support.
It’s very easy just to compile high level code with no CPU maker’s specific compiler optimization flags set in the compiler. That Code will certianly run but it will run in a non optimized fashon that will not be as efficient for large server farm usage and require more server hardware to get the same amount of work done.
The big server customers took their sweet time evaluating AMD’s Epyc and part of Epyc’s Certification process was taken up by getting the optimization process established for first generation Zen, but the total process should go a little quicker for Zen 2. The x86 ISA is still going to be more popular for the few M$ shops running Windows on bare metal but a lot of the cloud services providors are running VM based Hypervisors or Containerized OS instances and the major VM/Container products are optimized for both AMD’s and Intel’s x86 hardware and some ARM ISA based hardware that been on the market the longest. The big cloud services providors usually have more reasons/money to make the effort at optimizing at the hand coding level for any makers custom CPU core designs with any smaller operations just recompiling with whatever complier optimization flags set if needed.
IBM’s/OpenPower’s Power ISA runs fairly optimized for the Linux OS kernel on the Power ISA but IBM spent some billions over a period of years getting Power8/Power9 optimized for Linux, in addition to IBM’s in-house OS/software ecosyetem products.
The server market will probably remain mostly x86 based in the near term but not for very much longer after as the major custom ARM core providors will be spending what is necessary to get their custom ARM core SKUs running better optimized on all the VM/Continer software offerings. Google’s Zaius, that Google co-engineered with Rackspace Hosting, is Power9 based. So the big players like Google/Others will make it easy for the smaller players to switch away from x86 to ARM/Power and other ISAs.
AMD better be holding on to all the K12 custom ARM core blueprints/verilog that Jim Keller and the K12 team produced at the same time that AMD had Keller/Zen Team designing the x86/Zen Micro-Arch. AMD is still in a better position in the AI/Compute markets to offer up its Radeon GPU compute/AI acceleration products in a package pricing deal with its Zen/Epyc server/AI offerings. But AMD needs to save its K12 IP for any timeframe where ARM will be more competative on the low power efficiency curve on high density servers that will be in use more often in the future. The ARM market has not taken off as quickly as expected but even slow and steady growth has to be watched by AMD/Others least they end up like Intel did in the mobile market.
So Google is an established player and the other large players adopting any NON x86 ISA based CPU products will eventually make it easy for the smaller players to make the switch to a different ISA. This is Especailly so with Google/Other big providors making use of the Linux Kernel and mostly open source software for their massive server farms OS/Software infrastructure.