CCI-550 and Closing Thoughts
The final major release this cycle is the CCI-550 interconnect. Memory bandwidth is extremely important as we are facing 4K displays and VR/AR applications. Memory transactions can seriously slow down performance as well as consume a lot of power. The CCI-550 was designed to increase usable bandwidth, lower power consumption, and improve latency in time-sensitive workloads.
CCI-550 is fully coherent and features shared virtual memory. This is the basis for more efficient CPU and GPU interactions. Data does not have to be shuffled back and forth from CPU and GPU memory, which saves performance and power. It also features the less complex AMBA 4 ACE controllers rather than the AMBA 5 Coherent Hub Interface. This change will not affect users as the A73 cores are meant more for consumer products rather than the enterprise focused AMBA 5 CHI. It can support up to 6 ACE units.
The heterogeneous compute functionality is a boon to developers. In previous solutions developers had to spend a lot of time to get workarounds working for GPU computer. Not only does data need to be copied and moved from one memory section to another, but caches also had to be cleaned when changes were made. There is no cache maintenance in hardware full coherency with fine grained Shared Virtual Memory.
ARM integrated an advanced snoop filter to enable fewer cache accesses to determine cache coherency. The snoop filter essentially pings the interconnect about cache coherency status instead of pinging the individual CPUs. This cuts down on interconnect and CPU traffic dramatically, increasing performance, and cutting down on power consumption. It also reduces main memory accesses which again improves performance and lowers power. Furthermore, it can keep idle cores clocked further down by not having to clock them up when cache accesses are performed for data checks. This snoop filter is mandatory for all CCI-550 implementations and has very positive effects on performance and power.
There are also embedded QoS (Quality of Service) functions that can serve to optimize memory access depending on the applications time sensitivity. Typically the CPU is extremely latency sensitive and needs data as soon as possible. The GPU or DMA units are less sensitive to latency, but require a lot of bandwidth. Things like the display controller has a maximum latency constraint, so when it needs a buffer flip to display an image, it requires a set time and no further. Anything later than that time will cause problems on the display for the user. Enhanced QoS may not save any power, but it will help to increase overall SOC performance by optimizing memory access to the parts that need it the most in a timely manner.
ARM has certainly evolved nicely throughout the years. I can remember five or six years ago they would announce designs and licenses and it would be two years before they would hit market. They would introduce them, excitement would build about the capabilities of these upcoming parts (think Cortex-A9 or A15) and then it would die down until eventually parts would become available to end users years later. Since that time ARM has improved its time to market by implementing their PoP IP program which delivers designs to their partners in a manner which cuts down time to market (delivers RTL design abstractions so less work has to be done by partner’s engineers).
We are hitting a new phase with the announcements of the A73 and G71. Initial products will start coming off fab lines in late 2016 and we will see actual consumer level products in early 2017. We also must look at the overall market changes over the past few years that have changed the company and their product announcements and shipments.
Many thought that Intel would force itself into the market and apply all their manufacturing might to take over mobile. It did not work out that way. Initially it seemed as though mobile was only a secondary thought behind the designs. New process nodes where not quickly utilized for these parts and they were often only barely competitive with what ARM and its partners were able to provide. Poor feature sets, a lack of a mobile x86 applications, poorly featured graphics and API support dogged these early chips. Once Intel finally started to get serious and offer some huge deals to tablet makers did we start seeing some uptake with Intel parts. Still, these products were low end units which did not offer anything other than basic functionality and mediocre performance as compared to Apple’s tablets. Intel also continued to utilize the n+1 generation of process technologies for these mobile parts.
There are likely several reasons for the failure of Intel to break into the mobile market. Offhand I would consider the very basis for it all the x86 instruction set. The x86 ISA was not designed from the outset to be a mobile platform. The very latest x86 parts from Intel work great from 5 watts up to 150 watts. Sub-5 watts starts to become problematic when comparing power efficiency and performance with a similar ARM SOC at that same power envelope. While Intel made some significant leaps with their mobile focused parts they just were never able to surpass the competition. Intel has retreated from this market and is focusing on their high margin products.
This leaves ARM as the only real player in town when it comes to the mobile marketplace. This does not mean that ARM is ready to rest on their laurels when it comes to developing new technologies and licensing them out. Their licensing agreements also provide impetus for ARM to continue to innovate. Companies such as Qualcomm and AMD which have the ISA level licenses can innovate on core designs and graphics to differentiate their parts from those that only license specific cores. ARM’s other partners help to keep the pressure on ARM to continue to innovate with the licensable designs such as A73 and G71. This makes for a fairly healthy and competitive marketplace with plenty of room for innovation and differentiation. Throw in a choice of foundries and process nodes and we see a huge spectrum of products that can address even niche markets successfully.
The Cortex-A73 and Mali-G71 are both very fascinating products. They are both in ways simpler than the products that preceded them, but they offer more features, more performance, and more power efficiency. Some years ago I wrote about the slowing down of process technology and how to have generational performance improvements without relying on the 18 to 24 month cycle of process innovation, chip designers will have to learn how to do more with less. With current SOCs coming in with transistor counts in the billions, there is a lot of room for innovation and optimization without adding billions more transistors to achieve this. It seems that ARM has done just this with these designs.
These products will see production on multiple process technologies, but the one that is of most interest is of course TSMC’s 10nm FF line. ARM is already expecting A73 silicon on 10nm any day now, but that particular test chip utilizes an older Mali GPU. Products arriving in 2017 will feature 10nm, A73, and G71 technologies all wrapped together. The high end premium phones look to have good CPU performance, but the GPU performance will deliver an experience beyond what could have been imagined years ago. The entire package of A73, G71, and CCI-550 is a system level approach at maximizing performance without exceeding the TDP envelope of modern mobile devices. Expect a significant jump in overall capabilities of mobile devices come 2017.