Mali-G71 and Bifrost Continued

The ALUs are now quad based which improves performance for a variety of reasons.  The front end packages up instructions into quads which are then sent to the execution engines.  Having instructions packaged in quads helps in a variety of ways.  Typically the instructions are fairly closely related to each other, so they share a lot of data, or that data is very close to each other in memory or caches.  So when data needs to be accessed, the locality of that data next to each other in memory allows for faster access and lower bandwidth needs.  The ratio of execution units to a texture unit per quad is three to one.  All of the active units in the GPU core are directly connected to the control fabric.  Communication happens at low latencies and high relative bandwidth.

Quad vectorization in this case is inherently more efficient than the previous SIMD vectorization features in the Midgard architecture.  For example if three instructions are executed it might take four cycles to complete that work.  Because of the quad setup the instructions are rearranged so that it only takes three cycles in this theoretical workload rather than four, and the execution units are more fully utilized and can go into a lower power state by the fourth cycle.

Clause execution is another innovation that allows the execution units to remain busy and eliminate bubbles.  The GPU can rearrange scheduling so that instructions can be executed one after another, regardless of dependencies in most cases.  The execution units are kept busy, thereby improving performance and decreasing latency.  These units work on instructions and can bypass other instructions that are waiting for results or other dependencies.  It is not perfect, as only so much work can be done before new data is needed to further process instructions.  But this does decrease and squash bubbles in the execution units.  Work is done more efficiently and again the GPUs can go into lower power states to decrease usage and TDP faster than in previous architectures.  There is enough register space to store results that it is trivial to insert another quad of instructions and get those results even if another quad is waiting for other results.

The texture unit is very similar to what was included with Midgard, but a few changes have been made.  It added a few more conversions on the tex units and supported new compression formats.  It also has gotten rid of some of the older and more obscure texture formats that are no longer used.

The tiler unit has undergone significant improvements.  Geometry loads have grown, and so has the tiler changed to support that load.  It has the same hierarchical binning design as Midgar, but it features redesigned tiler memory structures.  Micro-triangle elimination reduces the number of primitives stored in bin buffers for geometry-dense screens.  This is the first area where culling can be done.  Minimum buffer allocations have been eliminated and buffer allocation granularity now much finer.  These changes have enabled a 95% reduction in tiler memory footprint.  This leads again to more performance and higher efficiency.

We can see from the slides all of the other structural improvements that ARM has implemented with Bifrost.  While this architecture will not be able to push next generation VR (120 Hz, 4K res), it is going a long ways towards improving mobile performance in VR and AR.  It will not be a 2016 product, but it will see introduction in 2017.  This is a big step up in per transistor performance as compared to Midgard.  It also includes advanced features such as heterogenous computing and TrustZone media protection.

« PreviousNext »