The basic Zen architecture found in the Ryzen processor, that we have discussed and debated many times on PC Perspective in the past, remains unchanged for EPYC. The core was designed with server and data center placement in mind, according to AMD’s briefing I took part in yesterday. There are a lot of discussions and accusations from outsiders that have called the AMD EPYC processor simply a “glued together” Ryzen CPU, attempting to construe this platform as a desktop part simply repurposed to act in the server environment. Despite the fact that it really doesn’t matter what the pedigree of the architecture was, only how it performs in the necessary workloads and environments, AMD did give us more detail on the die to die and socket to socket communications to counter.
What AMD calls Infinity Fabric is actually a collection of interfaces that range in use from intra-die to inter-die to inter-socket. While the specific details of how these may be similar, or different, are still currently not detailed, we do now know the performance specifications between these to help judge the capability they offer.
AMD EPYC, and the upcoming Threadripper consumer HEDT parts, are multi-die packages. EPYC will have four dies on each CPU package, regardless of the number of enabled cores. AMD is disabling cores in a symmetrical pattern, so a 32-core part will have four dies with all 8 cores enabled on all four. A 24-core processor will have 1 core of each CCX, and thus 2-cores per die, disabled. A 16-core part will have 2 cores of each CCX disabled, 4 per core. And the 8-core part will have 3 of the 4 cores per CCX disabled, leaving only per core, and 2 per die.
I am still waiting for input from AMD on this, but it does bring up concerns of the L3 / thread-to-thread latencies of the architecture that we have discussed in the past. In the worst-case scenario, the 8-core design, there would only be one core per CCX which would require all inter-core communication to happen through the L3, maximizing latency. Couple that with the unknown quantity that is latency from die to die (or even socket to socket) and you have an interesting comparison point between platforms to dive into. I am working to get more information on this, as well as hardware to test and compare.
What AMD has shared to this point is some impressive bandwidth numbers to help alleviate any concern with the multi-die package implementation.
AMD has built low power, low latency links between the cores that offer as much as 42 GB/s of bi-directional bandwidth at extremely low power. Every die is connected to every other die, enabling single hop data travel between any two. They can run at extremely low power states when the cross traffic is minimal, keeping TDPs of the processors low.
When we look at the socket to socket bandwidth and connection diagram, each die is connected to the matching peer die on the other socket with a 38 GB/s bi-directional link. That gives EPYC at most two hops of latency to traverse between any two cores on the system, with a total aggregate bandwidth availability of 152 GB/s.
The IO and connectivity portion of the processor supports eight x16 PCIe links, getting us to the magic 128 lane number. Each die supports two of them, though one is used for the socket to socket interface in a 2P system, leaving each CPU with 64 lanes of PCIe, a total of 128 for the system. Each of the links supports 32 GB/s of bandwidth and 256 GB/s per socket – a substantial amount of potential throughput. Those connections can be divided into as many as 8 PCIe devices per x16 link, totaling 64 possible x1 PCIe connections. How about that for a coin mining server?
When you put it all together, it might look a bit messy, but the bandwidth and connectivity is there to make EPYC a powerful server processor and platform. The only question that remains unanswered for me is the “low latency” part of this slide that hasn’t been quantified…and that we haven’t tested yet.
AMD also claims that there are additional benefits to its multi-die design. For one, the amount of combined die space from these four chips is larger than a reticule can produce with today’s lithography technology. Essentially, AMD claims this product could not have been built as a monolithic die in its current configuration. This method does help AMD increase yields and thus better attack the high end of the server / workstation market with more products at the top of the stack. The flexibility to configure dies is clearly an advantage over a single, monolithic die.
A Performance Example
In the build up to today's release, AMD brought media to a room full of demos of EPYC at work, one of which stood out particularly to me. Using a single socket system, and access to the full 128 lanes of PCI Express from the EPYC 7601 processor, AMD was able to break world records for IOps using the FIO storage benchmark. Paired with the EPYC processor were 24 NVMe SSDs from Samsung, PM1725A in this case, all running with access to a full x4 of PCIe 3.0 bandwidth. At 3.2TB each, the total capacity of this software defined array was 76.8TB!
These numbers are incredibly impressive – 9.1M IOPS read, 7.1M IOPS write (both running at full random 4K), and 53.3 GB/s of storage bandwidth when run at 128K random! Even more impressive for the EPYC platform is that the server still has 32 lanes of PCI Express remaining for networking hardware, compute resources like accelerators, or more storage controllers. This is a clear example of how the massive amount of IO connectivity AMD has brought can change the data center TCO landscape.
Closing Thoughts, for now
AMD is in a significantly different space today than it was only 4-5 months ago with CPUs. It has gone from a lingering memory in the minds of gamers and DIY builders to a prominent player in the field. It has revamped interest in enthusiasts and OEMs like Dell for high-end gaming PCs and mainstream desktop builds. And today it prepares to make the same shift for the server and enterprise markets, launching the EPYC data center platform that returns competition to a market that has had a single major player for more than half a decade.
Based on the data I have seen, the products as they are described to me, and the ecosystem in the state that it currently resides, it’s hard to imagine AMD not being able to make significant headroom in this field. The definition of “significant” is going to vary depend on who you ask. Those that are wishing for the return of the Opteron peak will target a 20%+ market share as the necessary milestone. Intel might view even a couple percentage points of its highly profitable Xeon market as significant. To me, AMD management should probably be looking at a double-digit goal by 2020. That will shift the views and opinions of AMD from outsiders, help stabilize a financially taut corporation, and will open gates and allow for more customers to feel comfortable with the product line.
But let’s be clear, though it should be an easily attainable goal to gain market share where you have almost none, there are roadblocks. AMD needs to prove the product can perform and in more than just SPECint benchmarks. Considerable work is done by Intel on a yearly basis to optimized its hardware stack to meet the needs of the major platform players (think the Super 7). AMD needs to make a performance and a cost argument to these groups that will turn heads.
AMD must also avoid any potential platform pitfalls that plagued the Ryzen consumer launch. Data center customers have zero tolerance for that and mission critical systems and data need to be running at 100%.
Can they do it? Absolutely. Will they? Hopefully we’ll see more in the coming days and weeks to prove they are on the right track.