ARM Refreshes All the Things
ARM has redefined the SOC market in 2018.
This past April ARM invited us to visit Cambridge, England so they could discuss with us their plans for the next year. Quite a bit has changed for the company since our last ARM Tech Day in 2016. They were acquired by SoftBank, but continue to essentially operate as their own company. They now have access to more funds, are less risk averse, and have a greater ability to expand in the ever growing mobile and IOT marketplaces.
The ARM of today certainly is quite different than what we had known 10 years ago when we saw their technology used in the first iPhone. The company back then had good technology, but a relatively small head count. They kept pace with the industry, but were not nearly as aggressive as other chip companies in some areas. Through the past 10 years they have grown not only in numbers, but in technologies that they have constantly expanded on. The company became more PR savvy and communicated more effectively with the press and in the end their primary users. Where once ARM would announce new products and not expect to see shipping products upwards of 3 years away, we are now seeing the company be much more aggressive with their designs and getting them out to their partners so that production ends up happening in months as compared to years.
Several days of meetings and presentations left us a bit overwhelmed by what ARM is bringing to market towards the end of 2017 and most likely beginning of 2018. On the surface it appears that ARM has only done a refresh of the CPU and GPU products, but once we start looking at these products in the greater scheme and how they interact with DynamIQ we see that ARM has changed the mobile computing landscape dramatically. This new computing concept allows greater performance, flexibility, and efficiency in designs. Partners will have far more control over these licensed products to create more value and differentiation as compared to years past.
We have previously covered DynamIQ at PCPer this past March. ARM wanted to seed that concept before they jumped into more discussions on their latest CPUs and GPUs. Previous Cortex products cannot be used with DynamIQ. To leverage that technology we must have new CPU designs. In this article we are covering the Cortex-A55 and Cortex-A75. These two new CPUs on the surface look more like a refresh, but when we dig in we see that some massive changes have been wrought throughout. ARM has taken the concepts of the previous A53 and A73 and expanded upon them fairly dramatically, not only to work with DynamIQ but also by removing significant bottlenecks that have impeded theoretical performance.
It is tempting to think of DynamIQ as an actual structure on an ARM SOC, but it is much more complex than that. DynamIQ thoroughly integrates itself into each CPU and allows significant flexibility in design, power delivery, and performance. There are several primary concepts to take away from DynamIQ. First is much improved communication between CPUs, accelerators, and memory. Secondly is the added flexibility of mixing and matching CPU cores and types depending on the focus of the partner for this product. Each SOC can have differing numbers of CPU cores and types as compared to the more rigid setup that was required with previous big.LITTLE configurations. Third is a more fine grained power control over the CPUs so that each “cluster” can have greater control over voltage.
Each CPU cluster attaches to a DSU, or DynamIQ Shared Unit. There can be multiple DSUs in an SOC depending on how many different types of CPU cores or accelerators are used. Each DSU features asynchronous bridges to the CPU clusters. These allow the CPUs to run at different clock speeds from the rest of the SOC and other CPU clusters. DynamIQ allows more fine grained control over individual cores in the clusters in terms of clock speeds. Cores can be clocked up and down depending on performance and efficiency needs. Each cluster will share the same voltage plane, so it is an improvement from previous big.LITTLE implementations, but it is not as finely grained as we expected. The ability to clock down or turn off the individual CPUs in the cluster will be a boon for power consumption even though voltage regulation per cluster is not on a per CPU basis.
The mixing and matching of the CPU types is dramatically enhanced with DynamIQ. Possible implementations could be a single A75 combined with seven A55s allowing for excellent single thread performance as well as exceptional multi-threaded performance with a relatively low power consumption as compared to previous big.LITTLE configurations.
The L3 cache is contained in the DSU. It can be removed entirely if a partner does not feel their product needs it. It can go up to 4 MB in size if that is the desire of the partner. The L3 cache can be partitioned up to four parts if needed. This partitioning also exhibits itself in the ability of the L3 cache to power down portions of itself. It appears as though it can be fully powered up (high intensity applications), to half powered (video playback), or fully down (low intensity applications like music playback).
If you would like to read more about DynamIQ you can refer back to our article from March when I first covered the technology.
The ARM Cortex-A53 may have been the company’s most successful product to date. It can be found in a massive variety of configurations and products that go from the low end (dual core A53) to the highest end products (big.LITTLE configurations with A73). ARM estimates that partners have shipped 1.7 billion chips featuring the A53. It features solid performance at power sipping efficiency. We even have seen configurations of octo-core A53 SOCs. Even though this has been a widely adopted part, there are still improvements that can be made.
The new A55 is a largely redesigned part based on the core concepts of the A53. It is based on the latest ARMv8.2 architecture specifications. The A55 is still an in-order CPU that features a very small die size and excellent power efficiency. Out-of-order CPU designs are much more complex and power hungry, so ARM decided to stay the in-order course, yet deliver more performance and higher efficiency.
The biggest change for the A55 over the A53 is that the new part features per-core L2 caches running at core clock speed and a large shared L3 cache that likely runs slower. The A53 featured a shared L2 cache among the CPUs. ARM also decreased the latency to the individual L2s, which has a very large positive effect when working with in-order processors. The addition of a relatively large, shared L3 also insures that as much data is available close to the CPU cores without going to main memory. ARM further optimized all memory accesses across L2, L3, and main memory. This work probably had the largest net positive effect on performance without increasing power consumption across the design. In fact, when using the same process node and frequency, the A55 is about 15% more efficient when looking at power consumption.
L1 and L2 TLBs were also increased to again allow more data to be closer to the execution units. L1 caches can vary in size from 16KB to 64 KB and be four way set associative. The L2 caches can be configurable up to 256KB in size. Larger CPUs usually feature more L2 cache, but in this case larger caches tend to eat up more power. When we figure in the exclusive structure, lower latency, and features such as cache stashing we have a much more robust and higher performance cache setup than the A53.
Data and instructions that can be quickly accessed by the CPU as well as robust branch prediction and prefetch are again key to achieving good performance with an in-order architecture. Removing or limiting instruction bubbles in the pipeline nets large increases in overall performance. For A55 ARM developed a new data prefetch that is further able to utilize the lower latency to the caches and main memory. Branch prediction also gets a boost with what they somewhat laughingly termed “sorta neural net branch predict, which mostly complies with the definition of a neural net”. This may be a bit of ribbing towards AMD and their new neural net branch prediction, but from all indications it does in fact meet the definition.
There are also some significant changes in the execution units. Previously in A53 there was a shared load/store pipeline. In A55 they have separate load and store pipes. This allows dual issue load/store as compared to previously when it could do one at a time. There are also dual NEON units in A55 instead of the shared NEON/Int pipeline and single NEON pipe. The NEON units have been improved significantly in terms of throughput and latency. The new units support FB16 formats and can do eight 16-bit ops per cycle. They can also do four 32 bit ops/ecycle with dot product instructions. FMA latency went from 8 cycles to 4 cycles. Putting all of these improvements together makes for some significant performance improvements in many workloads.
These are the major areas of focus for ARM, but more of course was done throughout the design process to improve overall performance and efficiency. A rather obvious lesson that we can learn here is that when dealing with designs that are comprised of billions of transistors, there are many avenues of design that can improve efficiency and performance without inflating die sizes and transistor counts. In other words, there is more than one way to skin a cat. The A55 will be slightly larger than A53 due to the cache structure and inclusion of a large L3 cache provided by DynamIQ. Even so, ARM has been able to improve their power distribution and gating to not only achieve higher burst clock speeds, but also improve overall TDP so longer boosts can be enabled to improve overall performance.