Cortex-A75 and Mali-G72
The Cortex-A75 is a slightly larger upgrade in terms of performance while retaining the power characteristics of the A73. Much more work has been done to this chip than the A55, but the results show a pretty impressive increase in overall performance while not inflating transistor counts and power consumption. Perhaps what is most interesting here is that the A75 is slightly smaller than the A73 when both are fabricated on the same process node. The A75 is also about 2.5x larger than the A55.
Following the A55, the A75 gets the per-core L2 cache running at full core speed. Previously the A73 had the shared L2 cache. The DynamIQ portion of course retains the shared L3 cache. By associating the individual L2 caches with a core ARM was able to lower latency again by about half when looking at L2 accesses. L2 caches can be from 256 KB to 512 KB in size. Between the L2 and L3 caches, this likely is the largest factor of improving performance of the A75 vs. the older A73. They also add in the features of cache stashing, atomic operations, and cache clean to persistence.
The L1 d-cache is 64 KB and four way. It is exclusive to L2 cache to maximize memory space. Inclusive caches can have performance advantages due to snooping and cache accesses in multi-CPU applications, but ARM considers the extra space to be far more important to per core performance. L1 and L2 caches both feature higher throughput and lower latency as compared to A73 units.
The front end of the A75 also received quite a bit of attention. The older A73 featured a 2 wide decode unit. When designing this part years back ARM really focused more on achieving better power efficiency by rebalancing the front end to more adequately match up the execution pipelines. Now it seems the bottleneck has changed a bit with the addition of the individual L2 caches and the large, shared L3 cache. ARM went back to a 3 wide decode stage that features single cycle decode with instruction fusing and micro ops.
Branch prediction has also received much attention, but it is based on the A73 unit and does not share the “neural net” description of what is found in A55. This unit features a 0 cycle branch predict that sustains instruction bandwidth to the cores without introducing bubbles. Any improvement in branch prediction will always be reflected in more efficient use of the execution pipelines so that work can be finished quicker and low power states achieved sooner.
The NEON units have also received much the same work done as with the A55. It now has FP16 support and has double the throughput as compared to single precision workloads. It also features Int 8 dot product which is a favored format for neural net learning (does not need high precision).
ARM is also adding features for when they feel that partners will inevitably use A75 for server workloads. RAS functionality is included with error correction throughout. There is an enhanced mesh network for providing much higher scalability than before all the while improving bandwidth and communication between CPU clusters. This is of course very important for those partners who are wishing to integrate far more CPU clusters than what we would find in mobile and low power applications.
These changes have all come together to give the A75 about a 20% uplift in most applications while retaining the same power consumption as a A73 part made on the same process node. ARM continues to improve their position in the market by increasing the capabilities of their products as well as consistently improving performance from generation to generation regardless of process nodes used
Graphics have been grabbing center stage in the PC world for some time, and it is equally important in the mobile world as we are seeing far more immersive graphical applications as well as the challenges of AR and VR. ARM has taken the previous G71 and revised it to a great degree to make the G72. The Mali-G72 is the second “Bifrost” architecture design that ARM has released. Pretty much every part of G72 has been given the fine toothed comb and enhanced over the previous G71. While it is still based on the relatively new Bifrost architecture, it has been enhanced in terms of both performance and efficiency. This is very much the theme of this refresh cycle for ARM for 2017.
Seemingly no single piece of G72 has been left untouched. It is still a tile-based deferred renderer as they are the most power and memory efficient architecture for mobile applications. Immediate mode renders have some good advantages, but they come at the cost of increased power consumption and memory bandwidth. For a PC with a thermal envelope in the hundreds of watts, immediate mode makes more sense. For mobile, deferred renderers rule the roost.
The G72 is actually smaller than the G71 when using the same process node. I have mentioned before that there is more than one way to skin a cat, and ARM has taken this time to severely optimize the Bifrost architecture to make it faster, smaller, and more efficient. Who among you does not appreciate greater performance with no penalties with die size or power consumption?
Here is a list of changes that ARM has rendered upon G72. It has a larger tile buffer to increase efficiency of quads and tiles. This takes up slightly more space, but in the end saves power due to bandwidth efficiencies. Arithmetic units are rebalanced with instructions moved that make more sense according to most workloads. ARM then has identified identical instructions in these workloads and have cut them down. They have aggressively optimized complex instructions like reciprocal and reciprocal square roots that are common in modern applications and made them twice as fast.
The beauty of a GPU is that many small improvements and optimizations throughout a pipeline are magnified due to these units being scaled up to 32 shader cores. ARM further reworked the instruction cache and changed how eviction works, thus more quickly making more room in this cache. There are bigger caches throughout the tiler which decreases overall bandwidth needs. Extra L1 cache again takes pressure off of main memory accesses which improves performance and cuts down on power consumption, even though the larger L1 cache will consume slightly more power than a smaller cache. Those extra main memory requests will cost time and additional power. They have also increased the write buffer in the load/store unit and increased L2 cache up to 4 MB.
4x MSAA is essentially free on this TBDR. This becomes much more important in VR applications where the lensed headsets will accentuate aliasing. ARM supports Vuklan with G72, as well as DX 11.2. They are not quite up to DX12 levels yet, but that will eventually be in the works. This is not a big deal as mobile is still not a major target for DX12 development.
VR gets a boost from a hardware based multi-view implementation. One draw call to render to two eyes essentially. There is enough commonality in this implementation to reduce the amount of work to make VR applications much smoother. This must be implemented in the native engines and apps, but the functionality is there. They have also included asynchronous time warp to again improve the experience in VR. They have added the ability to stop a stream of tasks to insert a new task that is more time dependent. Some have feared that different workloads being run through Mali would inflict a penalty due to context switching, but ARM has enabled each thread to have its own context so the traditional context switching penalty is a thing of the past.
G72 natively support higher bitrate rendering internally, but there are blending improvements with G72. It has the ability to do HDR, to store those values, and then blend and write out. While it renders internally at a higher bitrate it does typically do tone mapping at the end of the pipeline.
Per clock performance of G72 is essentially the same as G71, but the power efficiency of the part has improved by about 25%. This will allow partners to increase the clockspeed of G72 over G71 parts to give around a 20% performance boost in most applications. The designs can also have up to 32 shader cores per implementation.
The G72 is a much more mature and efficient part than what we see with G71. This is to be expected, but the refresh of the part has lead to nearly a generational jump in overall performance and efficiency. ARM has done a lot to improve the functionality and performance of the part to address the added workloads afforded by AR and VR applications. G72 should be appearing by late 2017, but more likely in an early 2018 time frame.
DynamIQ really is the centerpiece of this release from ARM. It is a more efficient and powerful way to enable better communication and flexibility with computing resources. With this architecture ARM has allowed overall SOC performance to scale to about 20% greater from what we saw from previous generations at the same power envelopes and process nodes.
The changes to A55 and A75 further enhance the advantages that are brought to the stage by DynamIQ. When we throw in the advanced capabilities of the G72 we have a very impressive basis for a next generation SOC to power mobile platforms from 500 mw to 15 watts. Partners will license these parts and get them out in a very likely early 2018 time frame. This is again a far cry from the 3 year lead time from licensing to production that we saw years back from ARM and its partners. The design tools that ARM has implemented and allowed their partners to access has shrunk these design times to unheard levels.
The SOC of 2018 based on these technology pieces will be some of the most efficient, yet powerful mobile parts to be introduced to date. The leaps that ARM has made with their Mali architecture are exceptionally impressive considering where they were even 5 years ago against the competition of Qualcomm and Imagination Technologies. ARM continues to move forward and has introduced another set of competent and performant designs to a marketplace that continues to expand.