ARM Introduces Cortex-A77, Mali-G77, and New ML Unit
Arm Tech Day 2019
It can be generally said that most humans have a bad tendency to slow down or to delay innovation when they are at the pinnacle of their success and not challenged by a rival. We have seen this with silicon companies through the many product cycles that we have experienced over the past 50 years. Perhaps part of this is due to shareholders wanting better returns, so R&D is cut and product cycles lengthened to provide the greatest return on the initial investment. Add to this the slowing down of Moore’s Law which has again lengthened product cycles. It also denies semiconductor designers consistent improvements in speed and power derived from process node jumps.
When these observations are all put together, we understand why Intel has slowed their innovation to a crawl (how many generations of Skylake product are we on now?). Interestingly enough this behavior has not been seen in another company who has a virtual monopoly on their market and also faces the same manufacturing issues as everyone else. Arm has seemingly gone the other way now that they have conquered the mobile market. Their rate of innovation and releasing new and exciting products has only increased dramatically over the past 10 years.
I have observed in the past that Arm would often make product announcements that promised a small uptick in performance and would only see the light of day two to three years after the initial announcement. Each year after about 2010 they started to improve the breadth of their offerings by acquiring more design teams, investing more money in different areas, and worked hard to bring innovation to the mobile market. They expanded into mobile graphics that had once been dominated by Adreno and PowerVR and now are matching and exceeding the performance of every other group out there.
For the last several years Arm has announced new products on a yearly cadence and have delivered parts typically within a year of announcing them. Arm of course does not supply chips to the market themselves, but they have ironed out the deliverables to their partners and licensees in a such a way that time to market is not nearly as laborious as it once was with these new designs. Furthermore, Arm has exhibited consistent performance increases year over year that allows them to compete in terms of single core IPC with higher powered desktop chips from Intel and AMD. I am not making performance claims that the latest Arm offerings can beat these chips in every possible scenario, but Arm is still making fast chips that run at sub-5W TDPs at 3 GHz.
Earlier this month Arm invited journalists and analysts to London, England to be briefed on what Arm is planning for next year. There are four major parts that will see the inside of cell phones next year. The Cortex-A77 is a heavily refreshed A76 variant that still looks to improve performance by 20% process independent. The Mali-G77 is a brand new graphics architecture that adds another level of performance without inflating die size or transistor count dramatically. The Mali-D77 is a new display controller based heavily on the previous D76, but with several significant VR features added. Finally we have the not yet named Machine Learning Unit which provides ML performance at the edge.
This was a pretty significant Editor’s Day at Arm due to not only these new products, but also the company’s leading edge position on several major industry changes and advances. Behind this all we see Arm also investing heavily on the software side (CPU, GPU, and ML). We have also seen several top gaming titles ported over from PC and onto mobile. Very little optimization was done to titles such as “Fortnite” to get them running smoothly on mobile parts. This perhaps more than anything gives us an idea of how far mobile platforms have come and how much real world performance Arm offers at these TDPs.
The Cortex-A77
Arm has a pretty interesting cadence in that it introduces a new CPU from one of two design groups (Austin and Sophia), and then we see several refreshes over the next few years. Then we get another major architecture/redesign from the other group and then that gets refreshed a few times. This has been a pretty common scenario over the last several years in that they introduce a major new product, then they go through and optimize the design for a refresh. The Cortex-A77 is the refresh part from Austin’s A76 that was introduced last year. The A76 supplanted the A75, which was the last refresh from the Sophia group.
When we consider refreshes, we don’t often expect to see major changes throughout the design. A few optimizations here, some small additions there, and they add up to a couple % performance increase when produced on the same process node. This is not the case with A77. Some major changes were implemented to the chip that looks to give that 20% improvement in performance when running at the same speed as the A76 and on the same process. Arm aims for around 3 GHz for mobile and slightly higher for tablets. How Arm was able to get 20% out of a “refresh” is pretty amazing. The changes made could almost make this an entirely new chip if there wasn’t so much reuse throughout the design.
The big pushes could be categorized as “wider” and “lower latency”. A brand new 1.5K mop (short for micro-op) cache was introduced. This contains pre-decoded instructions, so when an instruction is issued again, instead of decoding it the chip looks into this cache and fetches it. This is a huge power and performance saver for the front end. It also helps to more effectively feed the execution units. Arm figures that around 85% of the instructions in a regular workload can be stored in this mop cache. It has doubled bandwidth from 32 B/cycle to 64 B/cycle on the branch prediction unit. It has increased the accuracy of branch prediction as well. We see next gen improvements to branch prediction accuracy. Increased branch target buffer (BTB). 33% larger main BTB (8K entry). 4x larger L1-BTB. This avoids more branch mispredicts to better feed the rest of CPU.
Through some clever port sharing, the A77 is now a six issue CPU which is 50% higher than the A76. It can issue 6 instructions/cycle in most cases. The pipeline has also been shortened from 11 cycles in A76 down to 10 cycles in A77. Two more units were added to the execution engine, and an additional ALU unit and Branch unit were added to help execute instructions that were dispatched from the wider decode/dispatch unit. The caches are optimized to improve bandwidth to the core. An improved data prefetcher looks at access patterns and makes predictions and prefetches to L1, L2, or L3 caches. This helps to hide critical load latency to main memory. There is increased performance and lower power consumption with efficient prefetching. These new engines work to discover patterns and improve accuracy, and this is all done in hardware with limited software engagement.
This is all still compatible with the previously released DSU (DynamIQ Shared Unit). It continues to work with the Cortex-A55 for bigLITTLE implementations. On current 7nm processes, Arm expects these solutions to run around 3 GHz with partners potentially moving that number up. Arm also expects that next generation 7nm/6nm processes could boost that number further.
Mali-G77
I believe Arm still has a way to go when it comes to marketing their product names. The Cortex-A77 was a refresh (albeit a major one) of a previous chip. The Mali-G77 is a brand new architecture that is codenamed Valhall rather than a refresh of the previous Bifrost family of GPUs. The difference here is of the complexity we see with NVIDIA’s jump from Pascal to Turing. So there are those that feel perhaps they are not doing the Valhall core any favors by not naming it G80 or something of the sort. Valhall is a major step forward for Arm and its Mali graphics.
From a functional standpoint, G77 is similar to G76. Once we dig in we see that the two architectures are very, very different. Arm has been watching graphics workloads change over the past decade in the mobile and tablet space. This change caused them to entirely rethink how the shader cores are setup and what workloads they will most efficiently handle. Mobile games are much more complex now than ever, and we have a machine learning component that was not well handled previously by the GPU.
The G77 seems like a fairly flexible unit that can scale from 7 shader cores to 16. The L2 cache can be configurable from 2 to 4 slices and from 512 KB to 4 MB. The fastest combination would of course be the 4 slices and 4 MB, but that of course comes at a price in terms of power and die space.
The G76 featured 8 wide, 3 execution engines per core and that changes to 2 clusters of 16 wide units. There is one of these engines per core. It features a new quad texture mapper in each core which is double that of the G76. Arm simplified the scalar ISA so it is easier to compile for and more fully utilizes the new design. The G77 features 16 wide warps that can contain up to 1024 threads. The have multiple ALUs running in parallel which has allowed them to get down to a 4 cycle datapath as compared to the G76 which featured ALUs that were set up in a more serial manner due to the datapath and was 8 cycles. Arm has also rebalanced the design so it features more FMA operations and 8 bit dot-product. Other lesser used data types are still supported, but in the case of FRCP.f32 it is down ⅓ from the G76.
Arm did not talk about DirectX support, but they were very keen on mentioning that their design more fully embraces Vulkan. It has new features such as AFBC1.3, FP16 render targets, layered rendering, and hardware allocated vertex shader outputs.
The G77 is also a better architecture for machine learning as compared to the G76. Not only does it feature more FMAs/cycle and 8 bit dot product, it also features a new load-store cache, faster atomic operations, the pipeline stages cycles cut in half, and a 16KB 4 way set associated fully coherent L1 cache. All of these things come together to make it much more efficient and performance oriented when it comes to ML workloads.
Overall the new design follows along the line of “keep it the same size and power, but give it more performance”. When we are dealing with billions of transistors in a SOC design, there is near infinite leeway to change the design and make it better up to a theoretical point. Since the industry has been used to a regular cadence of process node advances, there was not nearly as much pressure to fully optimize any one design. Now that process node changes are farther out than before, design is key. Arm has taken a look at the marketplace, the software, and where they could improve. For a slightly larger die size, Arm has given the G77 around a 33% performance boost over the G76 all the while not consuming any more power.
Machine Learning
Last year Arm announced their Project Trilium which would integrate a machine learning unit into a mobile application. We were given a little bit more information on this part as well as news that they have several license partners that will announce products at a later date. We still do not know very much about the architecture of the ML part, but we have some basic diagrams. They are aiming for about 5 TOPs per watt. The module can be scaled from 1 to 8 units. In theory (and if TDP allows) they can achieve over 30 TOPs in a single application (TDP not disclosed).
We will be covering this technology in a separate article as there are interesting implications and applications for edge based ML.
Closing Up
Arm is a major player in the marketplace, no matter what their competitors may say. They control the mobile market unlike any other, and they stay in the good graces of their partners by offering fair deals and consistent improvements in their core IP. The yearly improvements that we have seen for the past five years or so have helped to push the mobile marketplace into an area that few would have expected. Mainstream 3D games being ported to mobile with relatively little difficulty was a dream until recently.
The next generation of Arm products that will be showing up in 2020 will be their best performing, and most feature packed, yet. Their partners will have a tremendous amount of performance and efficiency to lean upon to deliver products that will outshine all that have come before. A77 and G77 will lead to more powerful devices, longer battery life, and a greater availability of VR and ML applications that will be debuting between now and then.
Arm continues to be aggressive and they continue to advance the company in a profitable and sustainable way. They are investing in areas that show a lot of promise, and they are doing it wisely without breaking the bank. We have already seen Arm make a strong impression in the notebook space, and the next year will see them advance that position with greater performance and software compatibility.
Next year’s Editor’s Day should be just as fascinating, if not more.