Cortex-A73 Continued and Mali-G71
The designers in Sophia were able to somehow achieve high speeds while taking down the stages to 11 as compared to 14 for the A72. ARM really went the other way when it come to “taking it down to 11”. The decreased stages in the execution pipeline allows for lower overall latency and higher throughput. The integer execution units again have been beefed up without impacting overall power consumption. The L1, L2, and memory controllers are highly power optimized, as these are regions with relatively high power consumption in most operations. To improve IPC a lot of work has been done throughout the front end, including having an out-of-order branch capability. All of these things together help to remove or squash any “bubbles” in the pipeline. Fewer instances of stall or wasted clock cycles help to improve IPC while allowing for these cores to throttle down and go into power saving modes quicker.
The NEON unit also had a lot of work done on it to decrease its size and increase its performance. Memory and L2 caches have also seen a lot of work done to keep as much data as close to the cores as possible with aggressive prefetch, dual L2 cache streaming, and enhanced arbitration for interleaving access. Up to 8 MB of L2 cache can be configured, but most implementations will include 1 to 2 MB units.
When all of these things are taken together, the A73 is a smaller and faster core than the A72, regardless of process node. If both are produced on TSMC’s 16nm FF+ process, the A73 will be smaller, perform faster, and consume less power. ARM has certainly taken a “do more with less” approach, and it appears to have paid off with the Cortex-A73.
The first thing we notice about the latest GPU technology from ARM is that they have simplified the nomenclature of their graphics parts to put them more in line with the Cortex series. The previous GPU based on the Midgard architecture is the Mali-T880. The G71 is based on the new “Bifrost” architecture that promises more features and better performance than the previous generation of Mali GPUs.
ARM has had a lot of success with their Mali GPUs and in fact there were over 750 million Mali based GPUs shipped in 2015. ARM is on track to surpass that in 2016. ARM’s partners have integrated this technology in parts that span from TVs, to autos, to other mobile devices. This has been a significant source of growth for ARM and their IP.
The Mali GPUs are still tile based units (deferred renderers) that build off of previous generations. The Bifrost architecture is aimed to be 1.5x faster than current (2016) parts. It can scale up to 32 cores in premium devices. Again, the rise of VR and AR is pushing designers to produce faster parts that do not increase power consumption (or hopefully decrease it while maintaining performance). The new Vulkan API also offers a new push to redesign parts to fully implement that feature set. Unlike previous OpenGL implementations, Vulkan has many mobile features built into the API. This was previously accomplished with OpenGL-ES which was more mobile optimized as compared to the full OpenGL specification.
Mali-G7 also supports heterogeneous computing. The problem with non-heterogeneous compute is that workloads done by the CPU have to be copied into memory, and then copied over to memory that is apportioned to the GPU. Once the GPU works on that data and writes it to its memory, it is then copied back to the CPU memory portion. OpenCL 2.0 introduces Shared Virtual memory, but ARM has improved upon this concept with fine grained buffers with full coherency. This can virtually eliminate memory operations which saves time, bandwidth, and power. Vulkan is fully multi-threaded as well.
The problem with modern workloads is that they typically stress the entire SOC. 3D graphics, CPU, and video decode/encode are often all working at once. The thermal budget of a SOC is only so large and it has to intelligently apportion out power to individual parts to achieve the best performance without blowing the budget.
When a design group has billions of transistors to work with, the opportunity to more efficiently and effectively utilize them increases dramatically as compared to when designs utilized “only” millions of transistors. The Bifrost architecture compared to Midgard at the same process tech and conditions is about 20% better in power efficiency, 40% better performance density, and about 20% better in bandwidth improvements. This is when comparing the two parts while featuring the same number of course. When we scale up to the full 32 cores in G71, ARM expects performance to be at the same level as a 2015 discrete laptop GPU range. As a comparison, this entire SOC will be sub-1 watt while a laptop discrete chip will be anywhere from 15 watts to 50 watts.
Bifrost features fully unified shader cores, a scalar ISA, the ability to do clause execution, has full coherency, and supports Vulkan and OpenCL. It is backwards compatible with OpenGL-ES 3.x and below. It has more performance per mm sq. and per line of real world shader code.