Branch Prediction and More

Branch prediction gets another improvement (which we always hear from product to product), but it feeds into something new.  AMD has implemented an Op cache for the first time.  Intel has been using an Op cache for several generations of products, but AMD was a bit slower on the uptake.  An Op cache stores commonly used micro-ops that can be fed directly into the micro-op queue.  When an x86 instruction is fed into the four-wide decoder, it is translated into a micro-op.  If the decode unit comes across an instruction that has already been decoded into a micro-op previously, it fetches it from the Op cache.  This results in faster decoding time as well as lower power consumption as the decoder does not have to access memory and caches non-locally.  The four-wide decoder and Op cache then feeds into the queue which can dispatch 6 ops per cycle.  AMD has worked hard to make sure that the front end is not bottlenecking the integer and floating point execution units.

The integer and floating point units have again received a lot of attention to make sure that throughput is high speed and low latency.  Each INT/FPU features its own rename and scheduler units as well as their own register files.  The INT units features four ALUs and two AGUs (load/store).  The FPU features 2 x 128 bit FMACs (comprised of 2 add and 2 multiply) that can be combined to provide AVX2 256 support.

Fetch continues along the path of improvements from previous parts.  Decoupled branch prediction, larger TLBs, wider pathways, and faster caches all help to feed the front end while reducing redundant work to improve efficiency and IPC.

Decode has been an issue with the Bulldozer architecture as it struggled to keep the dual INT cores fed adequately and with low latencies.  AMD has obviously addressed this and rebalanced decode to more adequately fill execution resources with instructions and data with the revised Zen INT/FPU units.  The redesigned four-wide decode and Op-cache are the two key parts as well as how the micro-ops are dispatched to the execution units.

The 6 micro-op dispatch feeds into the four ALUs and the two AGUs (address generation units).  Each cycle may not necessarily be comprised of 6 micro-ops, but it has the ability to handle up to that.  Most likely AMD has the front end tuned to keep it at 6 as much as possible so as not to waste cycles or execution time.  The core then has the ability to retire 8 ops per cycle.  Again, it appears that AMD is really optimizing throughput by increasing register sizes and retire rate so as to remove any potential bottlenecks in execution.

Load/Store and L2 caches see more improvements.  Cache sizes have not increased, but latency and bandwidth certainly have.

The floating point units from a high level do look more like the single float unit on the previous gen Excavator core.  It is divided up into the two MUL/ADD pipelines that are 128 wide and can be fused into a 256 bit unit for AVX2 instructions.  Increased queue, larger register file, and eight wide retire means it received much the similar treatment as the INT unit.  It supports SSE, AVX1, AVX2, AES, SHA, and the legacy MMX and x87 instructions.  I believe it has the ability to do single cycle 128b store, or two cycle 256b stores.

« PreviousNext »