CCI-550 and Closing Thoughts
CCI-550
The final major release this cycle is the CCI-550 interconnect. Memory bandwidth is extremely important as we are facing 4K displays and VR/AR applications. Memory transactions can seriously slow down performance as well as consume a lot of power. The CCI-550 was designed to increase usable bandwidth, lower power consumption, and improve latency in time-sensitive workloads.
CCI-550 is fully coherent and features shared virtual memory. This is the basis for more efficient CPU and GPU interactions. Data does not have to be shuffled back and forth from CPU and GPU memory, which saves performance and power. It also features the less complex AMBA 4 ACE controllers rather than the AMBA 5 Coherent Hub Interface. This change will not affect users as the A73 cores are meant more for consumer products rather than the enterprise focused AMBA 5 CHI. It can support up to 6 ACE units.
The heterogeneous compute functionality is a boon to developers. In previous solutions developers had to spend a lot of time to get workarounds working for GPU computer. Not only does data need to be copied and moved from one memory section to another, but caches also had to be cleaned when changes were made. There is no cache maintenance in hardware full coherency with fine grained Shared Virtual Memory.
ARM integrated an advanced snoop filter to enable fewer cache accesses to determine cache coherency. The snoop filter essentially pings the interconnect about cache coherency status instead of pinging the individual CPUs. This cuts down on interconnect and CPU traffic dramatically, increasing performance, and cutting down on power consumption. It also reduces main memory accesses which again improves performance and lowers power. Furthermore, it can keep idle cores clocked further down by not having to clock them up when cache accesses are performed for data checks. This snoop filter is mandatory for all CCI-550 implementations and has very positive effects on performance and power.
There are also embedded QoS (Quality of Service) functions that can serve to optimize memory access depending on the applications time sensitivity. Typically the CPU is extremely latency sensitive and needs data as soon as possible. The GPU or DMA units are less sensitive to latency, but require a lot of bandwidth. Things like the display controller has a maximum latency constraint, so when it needs a buffer flip to display an image, it requires a set time and no further. Anything later than that time will cause problems on the display for the user. Enhanced QoS may not save any power, but it will help to increase overall SOC performance by optimizing memory access to the parts that need it the most in a timely manner.
Closing Thoughts
ARM has certainly evolved nicely throughout the years. I can remember five or six years ago they would announce designs and licenses and it would be two years before they would hit market. They would introduce them, excitement would build about the capabilities of these upcoming parts (think Cortex-A9 or A15) and then it would die down until eventually parts would become available to end users years later. Since that time ARM has improved its time to market by implementing their PoP IP program which delivers designs to their partners in a manner which cuts down time to market (delivers RTL design abstractions so less work has to be done by partner’s engineers).
We are hitting a new phase with the announcements of the A73 and G71. Initial products will start coming off fab lines in late 2016 and we will see actual consumer level products in early 2017. We also must look at the overall market changes over the past few years that have changed the company and their product announcements and shipments.
Many thought that Intel would force itself into the market and apply all their manufacturing might to take over mobile. It did not work out that way. Initially it seemed as though mobile was only a secondary thought behind the designs. New process nodes where not quickly utilized for these parts and they were often only barely competitive with what ARM and its partners were able to provide. Poor feature sets, a lack of a mobile x86 applications, poorly featured graphics and API support dogged these early chips. Once Intel finally started to get serious and offer some huge deals to tablet makers did we start seeing some uptake with Intel parts. Still, these products were low end units which did not offer anything other than basic functionality and mediocre performance as compared to Apple’s tablets. Intel also continued to utilize the n+1 generation of process technologies for these mobile parts.
There are likely several reasons for the failure of Intel to break into the mobile market. Offhand I would consider the very basis for it all the x86 instruction set. The x86 ISA was not designed from the outset to be a mobile platform. The very latest x86 parts from Intel work great from 5 watts up to 150 watts. Sub-5 watts starts to become problematic when comparing power efficiency and performance with a similar ARM SOC at that same power envelope. While Intel made some significant leaps with their mobile focused parts they just were never able to surpass the competition. Intel has retreated from this market and is focusing on their high margin products.
This leaves ARM as the only real player in town when it comes to the mobile marketplace. This does not mean that ARM is ready to rest on their laurels when it comes to developing new technologies and licensing them out. Their licensing agreements also provide impetus for ARM to continue to innovate. Companies such as Qualcomm and AMD which have the ISA level licenses can innovate on core designs and graphics to differentiate their parts from those that only license specific cores. ARM’s other partners help to keep the pressure on ARM to continue to innovate with the licensable designs such as A73 and G71. This makes for a fairly healthy and competitive marketplace with plenty of room for innovation and differentiation. Throw in a choice of foundries and process nodes and we see a huge spectrum of products that can address even niche markets successfully.
The Cortex-A73 and Mali-G71 are both very fascinating products. They are both in ways simpler than the products that preceded them, but they offer more features, more performance, and more power efficiency. Some years ago I wrote about the slowing down of process technology and how to have generational performance improvements without relying on the 18 to 24 month cycle of process innovation, chip designers will have to learn how to do more with less. With current SOCs coming in with transistor counts in the billions, there is a lot of room for innovation and optimization without adding billions more transistors to achieve this. It seems that ARM has done just this with these designs.
These products will see production on multiple process technologies, but the one that is of most interest is of course TSMC’s 10nm FF line. ARM is already expecting A73 silicon on 10nm any day now, but that particular test chip utilizes an older Mali GPU. Products arriving in 2017 will feature 10nm, A73, and G71 technologies all wrapped together. The high end premium phones look to have good CPU performance, but the GPU performance will deliver an experience beyond what could have been imagined years ago. The entire package of A73, G71, and CCI-550 is a system level approach at maximizing performance without exceeding the TDP envelope of modern mobile devices. Expect a significant jump in overall capabilities of mobile devices come 2017.
Is it the sub milliWatt
Is it the sub milliWatt market Intel has left? I thought power consumption of Mali type chips would be more in the Watt range,unless it’s stand-by power that is being measured?
Otherwise a nice article.Hopefully phones might start lasting all day for longer soon,and not need charging more and more often as your battery degrades over the months,till eventually you are lucky if you get a morning out of them!
It should be “sub 1000
It should be "sub 1000 milliwatts… I gotta find that sentence and change it!
You are missing the slides
You are missing the slides that show the GPU’s variable latency clause handling and the ability to split a clause and do work on another unrelated clause while the latency is hidden and other work performed keeping the Quad execution resources operating at a better overall utilization on the clause level of scheduling for execution resources utilization efficiency.
This is described in Anandtech’s deep dive into the into the Mali-G71/BiFrost micro-arch.(1)
Man, AnandTech’s two articles, one on the new A73/Artemis CPU core’s Micro-Arch(2), and the one on the Mali-G71/BiFrost GPU Micro-Arch are up there on the same level with a Microprocessor Report(pay walled) articles this time around! I hope that AnandTech can keep that author around for when AMD’s K12 needs to be reviewed! That AnandTech author definitely has a chip Arch/Design background, and both of those articles are damn good for an outside of a pay walled publication.
It’s also imntresting to see talk about Vulkan and SPIR-V as an alternative(Temporary/Not?) for an HSA solution instead of HSAIL, in the AnandTech Mali-G71/BiFrost article.(1)
“From a software standpoint, it’s interesting to note that ARM has gone with an OpenCL 2.0-centric approach, intending to make the functionality accessible through that and related (SPIR-V utilizing) APIs such as Vulkan. G71 however does not support the Heterogeneous System Architecture’s HSAIL standard, this despite ARM being a member of the HSA Foundation. ARM did not have too much to say on the matter, but has stated that they never “totally bought into” HSAIL. OpenCL 2.0, by comparison, is a more generic implementation at the API level, leaving ARM to sort out the low level details as they see fit.
At this point heterogeneous compute is still a long term play for ARM. The potential performance improvements are, in the right scenarios, very significant. And using the GPU instead of the CPU is again a sound move when there’s lots of suitable parallel work to throw at it, especially in SoCs where power efficiency is so critical. But it will take time to bring software developers on board, so while the hardware will soon be here, it will take some time for the software to catch up.”(1)
(1)
“ARM Unveils Next Generation Bifrost GPU Architecture & Mali-G71: The New High-End Mali”
http://www.anandtech.com/show/10375/arm-unveils-bifrost-and-mali-g71
(2)
“The ARM Cortex A73 – Artemis Unveiled”
http://www.anandtech.com/show/10347/arm-cortex-a73-artemis-unveiled
You should check out page 3
You should check out page 3 where I describe clause handling and… have a slide showing it!
Yes I looked, but there is
Yes I looked, but there is some missing clause slides that describe the Variable Clause handling/scheduling on the GPU that are still not included, read the Anandtech articles they are very deep dives into the new A73/Mail-G71 micro-archs, and I mean really impressive for any publication that is not pay-walled. I hope that they don’t lose that author to the pay-walled publications, it’s bad enough that Anand was hired away by Apple, and this new author appears to really be good, he even made some of his own diagrams for his comparisons between Arm Holdings’ A72/related line of CPU micro-archs and Arm holdings’ A73/related Micro-Archs(in his A73 article). So there is really a lot of work there by the author to really dive deep into things, I’m really impressed by his work!
Your article includes more info on the CCI-550, but AnandTech’s deep deep dives are really impressive to read, and that’s without Anand being around with his great work/contributions. I really hope AnandTech can keep that author around so he can do and article about AMD’s K12 when it is officially announced.
I could be reading this
I could be reading this wrong, but I guess you are hinting that you sorta like that author over at Anand's?
I had 32 slides on 4 pages. We had about 10 press decks to choose from. Sorry I couldn't post every single one. I think there were in total about 150 slides. Really had to squeeze down what to use to give the best overall understanding without just making it one massive slide presentation.
No problems with the total
No problems with the total information that you provided as there is so much new information and new technology coming online. So I’ll read all of the articles across many online sources, and you have covered things that other articles have not included, so I’m reading yours and all the others that are out there, it’s just that extra few slides on the variable latency Clause scheduling on the new Mali GPU that is very interesting in comparison to the other Mobile/desktop GPU makers hardware/thread scheduling methods on their GPU SKUs. ARM holdings have been very busy since they released the A72, and earlier Mail GPUs, and that is some very innovative design/engineering for the A73, and Mali-G71.
Do you have any links to the ARM press/decks offerings, do any/all of the press decks have PDF/Other links that are allowed to be shared, or even white-papers from ARM holdings. That’s a lot of technical information that ARM has provided, and with COMPUTEX going on, It must be near impossible to keep up with all the new GPU/CPU/other technology information and hardware that has been premiered over that past month, to go along with the flood of information coming from COMPUTEX, especially from AMD and Nvidia.
It’s going to take months of reading for sure just trying to stay on top of things with so much change happening at one small amount of time. 2016 is going to get even more interesting with Zen and other information releases coming over the remainder of the year. I guess that everyone at PCPer will be benchmarking/reviewing some new AM4 motherboards and some Bristol Ridge CPU(soon to be swapped out with Zen) SKUs with some Polaris GPUs over the next few weeks until the NDAs fully lift. Thanks for all of your hard work.
FYI – thought it would be
FYI – thought it would be worth adding that ARM was a recipient of our 2017 Business Sustainability Award: https://sealawards.com/sustainability-award-2017/