Cortex-A73 Continued and Mali-G71
The designers in Sophia were able to somehow achieve high speeds while taking down the stages to 11 as compared to 14 for the A72. ARM really went the other way when it come to “taking it down to 11”. The decreased stages in the execution pipeline allows for lower overall latency and higher throughput. The integer execution units again have been beefed up without impacting overall power consumption. The L1, L2, and memory controllers are highly power optimized, as these are regions with relatively high power consumption in most operations. To improve IPC a lot of work has been done throughout the front end, including having an out-of-order branch capability. All of these things together help to remove or squash any “bubbles” in the pipeline. Fewer instances of stall or wasted clock cycles help to improve IPC while allowing for these cores to throttle down and go into power saving modes quicker.
The NEON unit also had a lot of work done on it to decrease its size and increase its performance. Memory and L2 caches have also seen a lot of work done to keep as much data as close to the cores as possible with aggressive prefetch, dual L2 cache streaming, and enhanced arbitration for interleaving access. Up to 8 MB of L2 cache can be configured, but most implementations will include 1 to 2 MB units.
When all of these things are taken together, the A73 is a smaller and faster core than the A72, regardless of process node. If both are produced on TSMC’s 16nm FF+ process, the A73 will be smaller, perform faster, and consume less power. ARM has certainly taken a “do more with less” approach, and it appears to have paid off with the Cortex-A73.
Mali-G71
The first thing we notice about the latest GPU technology from ARM is that they have simplified the nomenclature of their graphics parts to put them more in line with the Cortex series. The previous GPU based on the Midgard architecture is the Mali-T880. The G71 is based on the new “Bifrost” architecture that promises more features and better performance than the previous generation of Mali GPUs.
ARM has had a lot of success with their Mali GPUs and in fact there were over 750 million Mali based GPUs shipped in 2015. ARM is on track to surpass that in 2016. ARM’s partners have integrated this technology in parts that span from TVs, to autos, to other mobile devices. This has been a significant source of growth for ARM and their IP.
The Mali GPUs are still tile based units (deferred renderers) that build off of previous generations. The Bifrost architecture is aimed to be 1.5x faster than current (2016) parts. It can scale up to 32 cores in premium devices. Again, the rise of VR and AR is pushing designers to produce faster parts that do not increase power consumption (or hopefully decrease it while maintaining performance). The new Vulkan API also offers a new push to redesign parts to fully implement that feature set. Unlike previous OpenGL implementations, Vulkan has many mobile features built into the API. This was previously accomplished with OpenGL-ES which was more mobile optimized as compared to the full OpenGL specification.
Mali-G7 also supports heterogeneous computing. The problem with non-heterogeneous compute is that workloads done by the CPU have to be copied into memory, and then copied over to memory that is apportioned to the GPU. Once the GPU works on that data and writes it to its memory, it is then copied back to the CPU memory portion. OpenCL 2.0 introduces Shared Virtual memory, but ARM has improved upon this concept with fine grained buffers with full coherency. This can virtually eliminate memory operations which saves time, bandwidth, and power. Vulkan is fully multi-threaded as well.
The problem with modern workloads is that they typically stress the entire SOC. 3D graphics, CPU, and video decode/encode are often all working at once. The thermal budget of a SOC is only so large and it has to intelligently apportion out power to individual parts to achieve the best performance without blowing the budget.
When a design group has billions of transistors to work with, the opportunity to more efficiently and effectively utilize them increases dramatically as compared to when designs utilized “only” millions of transistors. The Bifrost architecture compared to Midgard at the same process tech and conditions is about 20% better in power efficiency, 40% better performance density, and about 20% better in bandwidth improvements. This is when comparing the two parts while featuring the same number of course. When we scale up to the full 32 cores in G71, ARM expects performance to be at the same level as a 2015 discrete laptop GPU range. As a comparison, this entire SOC will be sub-1 watt while a laptop discrete chip will be anywhere from 15 watts to 50 watts.
Bifrost features fully unified shader cores, a scalar ISA, the ability to do clause execution, has full coherency, and supports Vulkan and OpenCL. It is backwards compatible with OpenGL-ES 3.x and below. It has more performance per mm sq. and per line of real world shader code.
Is it the sub milliWatt
Is it the sub milliWatt market Intel has left? I thought power consumption of Mali type chips would be more in the Watt range,unless it’s stand-by power that is being measured?
Otherwise a nice article.Hopefully phones might start lasting all day for longer soon,and not need charging more and more often as your battery degrades over the months,till eventually you are lucky if you get a morning out of them!
It should be “sub 1000
It should be "sub 1000 milliwatts… I gotta find that sentence and change it!
You are missing the slides
You are missing the slides that show the GPU’s variable latency clause handling and the ability to split a clause and do work on another unrelated clause while the latency is hidden and other work performed keeping the Quad execution resources operating at a better overall utilization on the clause level of scheduling for execution resources utilization efficiency.
This is described in Anandtech’s deep dive into the into the Mali-G71/BiFrost micro-arch.(1)
Man, AnandTech’s two articles, one on the new A73/Artemis CPU core’s Micro-Arch(2), and the one on the Mali-G71/BiFrost GPU Micro-Arch are up there on the same level with a Microprocessor Report(pay walled) articles this time around! I hope that AnandTech can keep that author around for when AMD’s K12 needs to be reviewed! That AnandTech author definitely has a chip Arch/Design background, and both of those articles are damn good for an outside of a pay walled publication.
It’s also imntresting to see talk about Vulkan and SPIR-V as an alternative(Temporary/Not?) for an HSA solution instead of HSAIL, in the AnandTech Mali-G71/BiFrost article.(1)
“From a software standpoint, it’s interesting to note that ARM has gone with an OpenCL 2.0-centric approach, intending to make the functionality accessible through that and related (SPIR-V utilizing) APIs such as Vulkan. G71 however does not support the Heterogeneous System Architecture’s HSAIL standard, this despite ARM being a member of the HSA Foundation. ARM did not have too much to say on the matter, but has stated that they never “totally bought into” HSAIL. OpenCL 2.0, by comparison, is a more generic implementation at the API level, leaving ARM to sort out the low level details as they see fit.
At this point heterogeneous compute is still a long term play for ARM. The potential performance improvements are, in the right scenarios, very significant. And using the GPU instead of the CPU is again a sound move when there’s lots of suitable parallel work to throw at it, especially in SoCs where power efficiency is so critical. But it will take time to bring software developers on board, so while the hardware will soon be here, it will take some time for the software to catch up.”(1)
(1)
“ARM Unveils Next Generation Bifrost GPU Architecture & Mali-G71: The New High-End Mali”
http://www.anandtech.com/show/10375/arm-unveils-bifrost-and-mali-g71
(2)
“The ARM Cortex A73 – Artemis Unveiled”
http://www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled
You should check out page 3
You should check out page 3 where I describe clause handling and… have a slide showing it!
Yes I looked, but there is
Yes I looked, but there is some missing clause slides that describe the Variable Clause handling/scheduling on the GPU that are still not included, read the Anandtech articles they are very deep dives into the new A73/Mail-G71 micro-archs, and I mean really impressive for any publication that is not pay-walled. I hope that they don’t lose that author to the pay-walled publications, it’s bad enough that Anand was hired away by Apple, and this new author appears to really be good, he even made some of his own diagrams for his comparisons between Arm Holdings’ A72/related line of CPU micro-archs and Arm holdings’ A73/related Micro-Archs(in his A73 article). So there is really a lot of work there by the author to really dive deep into things, I’m really impressed by his work!
Your article includes more info on the CCI-550, but AnandTech’s deep deep dives are really impressive to read, and that’s without Anand being around with his great work/contributions. I really hope AnandTech can keep that author around so he can do and article about AMD’s K12 when it is officially announced.
I could be reading this
I could be reading this wrong, but I guess you are hinting that you sorta like that author over at Anand's?
I had 32 slides on 4 pages. We had about 10 press decks to choose from. Sorry I couldn't post every single one. I think there were in total about 150 slides. Really had to squeeze down what to use to give the best overall understanding without just making it one massive slide presentation.
No problems with the total
No problems with the total information that you provided as there is so much new information and new technology coming online. So I’ll read all of the articles across many online sources, and you have covered things that other articles have not included, so I’m reading yours and all the others that are out there, it’s just that extra few slides on the variable latency Clause scheduling on the new Mali GPU that is very interesting in comparison to the other Mobile/desktop GPU makers hardware/thread scheduling methods on their GPU SKUs. ARM holdings have been very busy since they released the A72, and earlier Mail GPUs, and that is some very innovative design/engineering for the A73, and Mali-G71.
Do you have any links to the ARM press/decks offerings, do any/all of the press decks have PDF/Other links that are allowed to be shared, or even white-papers from ARM holdings. That’s a lot of technical information that ARM has provided, and with COMPUTEX going on, It must be near impossible to keep up with all the new GPU/CPU/other technology information and hardware that has been premiered over that past month, to go along with the flood of information coming from COMPUTEX, especially from AMD and Nvidia.
It’s going to take months of reading for sure just trying to stay on top of things with so much change happening at one small amount of time. 2016 is going to get even more interesting with Zen and other information releases coming over the remainder of the year. I guess that everyone at PCPer will be benchmarking/reviewing some new AM4 motherboards and some Bristol Ridge CPU(soon to be swapped out with Zen) SKUs with some Polaris GPUs over the next few weeks until the NDAs fully lift. Thanks for all of your hard work.
FYI – thought it would be
FYI – thought it would be worth adding that ARM was a recipient of our 2017 Business Sustainability Award: https://sealawards.com/sustainability-award-2017/