Data Prefetch and Translation Look-aside Buffer
This content was originally featured on Amdmb.com and has been converted to PC Perspective’s website. Some color changes and flaws may appear.The process of getting data from the physical memory and placing it into the processor cache is common for processor manufacturers to do in order to increase perceived processor performance. This hiding of memory latency first started with the advent of prefetch instructions found in 3DNow! and SSE technologies. It worked, but only if the software was specifically written to take advantage of the capability. Software that didn’t use these instructions was forced to have the memory latency that prefetch tries to avoid.
Both the Athlon 4 and Athlon MP processor aid this deficiency through the use of a hardware data prefetch. AMD explained it best:
This data prefetch mechanism observes memory accesses looking for regular access patterns (for example, those present in look-based array data accesses), and speculatively fetches the cache line with the data into the processor’s L2 cache in advance of the actual data access. The Athlon 4 processor [thus Athlon MP as well] automatically optimizes performance on existing software that has not previously been optimized using the data prefetch instructions supported by 3DNow! Professional technology.The benefits that they mention above on the Athlon MP processor’s data prefetching are more easily observed in high-end, data-intensive applications that access large arrays of data. This includes databases such as MS SQL and Oracle as well as script languages that utilize these databases like ASP and Cold Fusion. An interesting note, this ‘hardware’ prefetching is actually more effective than the instruction data prefetching in 3DNow! Professional technology because the hardware prefetch doesn’t have to waste processor instruction execution time. AMD claims that this prefetching optimization is most effective when coupled with system memory with high transfer capability, another claim of how DDR memory is the future.
Until reading this documentation, as many of you, I had no idea what a Translation Look-aside Buffer was. TLBs is an additional cache that is used by a processor to translate the virtual memory addresses that it uses into physical addresses that are necessary for actually getting to the data in main memory. Whenever the CPU wants to access information in main memory, it first checks the TLB to find the virtual address. If it can’t find it (which happens less than 5% of the time) it has to go ‘look up’ the address manually; and that can cost precious CPU cycles. But with a successful TLB look up, the speed of access the memory increases upwards of 200%.
The Athlon 4 and Athlon MP processors incorporate three TLB micorarchitectural optimizations:
1. The L1 DTLB (Data TLB) increases from 32 to 40 entries
2. Both the L2 ITLB (Instruction TLB) and L2 DTLB use an exclusive architecture
3. TLB entries can be speculatively reloaded
While the increase in L1 DTLB to 40 will only result in a marginally increased TLB rate, the exclusive TLB architecture allow the combining of L2 TLB and L1 TLB sizes by ensuring that the L1 TLB does not contain duplicated entries of the L2 TLBs. Thus, reducing the number of conflicts caused by multiple TLB entries within the processor, performance on high-end database-type applications increases.
The third enhancement listed above, lets the processor store the TLB misses sooner in the TLB cache than on previous Athlon processors. On the Athlon MP processor instructions that use the same memory address multiple times will only force a missed TLB look up once instead of numerous times, thus increasing performance in those same high-end database applications.