1 A thought experiment

Suppose I handed you a book and asked you to tell me the 211^th word on the 83^rd page, then timed how long it took you to reply. Let’s call that time t_1.

Suppose I handed you another copy of the same book and asked you to tell me the 211^th word on the 83^rd page, then timed how long it took you to reply. Let’s call that time t_2.

We might reasonably assume that t_2 would be dramatically smaller that t_1. In fact, I’d expect t_2 to not even include the time needed to open the book. t_1 is an estimate of the time needed to find a particular word in a book; t_2 is an estimate of the time needed to recognize that the new book is the same as the old and repeat what you just said.

Now suppose that in-between the first and second request for the 211^th word on the 83^rd page of this book I gave you a hundred other books, pages, and words to tell me. There’s still some chance that you’d remember the first word when the same book, page, and word appeared a second time, but more likely you’d need to look it up again, though perhaps once you got to the page you’d remember where on the page the word was located.

Caching is a computer hardware (and sometimes also software) technique for mimicking this faster-because-we-just-did-this-and-still-remember optimization. Simple single-level caching either remembers a result or does not, but multi-level caching can also have that partially-remembered characteristic that lets a forgotten detail be recovered with only partial work.

Note that in this example, it is important to store more than just the word that is read; you also have to remember the address (book, page, and word) at which it was found. In caching terminology, that where we found it data is known as the tag; the thing being remembered is the block; and the tag + block pair is known as the cache line.

2 Locality

Locality is a pseudo-technical term in computing with two subtypes:

Spatial locality

Refers to two related ideas:

having only a small difference in numerical address
programs where most addresses accessed are spatially local to the address accessed before them.

Temporal locality

Refers to the accesses to a single address being clustered in the same portion of the overall runtime of the program.

Most programs, even if not specifically designed to do so, exhibit both spatial and temporal locality. Caches are designed to increase memory access speeds when either form of locality is present.

2.1 Cache line

When caching memory, it is typical to remember not just the exact memory accesses but also the memory nearby, on the assumption (based on spatial locality) that if one address is accessed, nearby addresses will soon be accessed as well.

The most common way to do this is that when an address x is accessed, all addresses with the same high-order bits as x are loaded into the cache; the data thus fetched is called a block. The tag of a block is not the full address, just the higher-order bits that all bytes in the block share.

Suppose a cache is designed with 64-byte blocks. When accessing address 0x12345678, all bytes with addresses between 0x12345640 and 0x1234567F (i.e., between 0x12345678 & ~0x3F and 0x12345678 | 0x3F) are loaded into the block, and the tag is 0x48d159 (i.e., 0x12345678 >> 6)

The low-order bits of the original address indicate where the addressed information is stored within the block and is known as the block offset or simply the offset.

Comparison to Virtual Memory

There are several parallels between a cache line and a virtual memory page. The common use of low-order bits as an offset into a contiguous chunk of memory is one such parallel; we’ll see later that most caches also break the high-order bits into something used to locate a line and something not so used, like virtual memory’s virtual page numbers and unused bits.

Virtual memory uses these tools to translate an address; if there is no translation, there is no data at that address, and a fault occurs. Caching uses them to accelerate an access; if the cache doesn’t find it, RAM will, and there is no true failure possible.

Virtual memory provides new capabilities to a system via the cooperation of the hardware and OS. Caching is managed entirely by the hardware and is only designed to increase speed, not add features.

3 Three types of caches

A cache is a finite-size memory that stores a subset of recent results of whatever it is caching. If cache is used without other context, it is generally safe to assume that what is being cached is memory accesses, but most of the concepts of caching are independent of that detail.

There are several designs of a cache, varying based on how they decide which old cache line is forgotten to make room for a new cache line.

3.1 Fully-associative

A fully-associative cache stores a set of cache lines. Using an unordered set instead of an array gives maximal ability to pick and choose which line is removed to make room for the next line, but that freedom comes at a cost.

First, on each cache access, a fully-associative cache needs to check the accessed address against the tag of every single cache line in the cache, as the line could be stored anywhere within the cache. Secondly, it is believed that the best rule about which line to replace is the least-recently used (LRU) rule: that is, replace the cache line that has gone the longest since it was last accessed. Computing that requires some additional bookkeeping, needing both more storage and more computation.

3.1.1 Capacity miss

A cache miss is when a cache is accessed but the line desired is not located in it. Some misses are inevitable, no matter the cache design, such as the cold miss (also known as a compulsory miss) when the address is accessed for the very first time. Other misses are caused by the specific design of the cache.

A fully-associative cache can suffer from a capacity miss. A capacity miss occurs when every line in the cache has been replaced since the last access to the line being requested.

3.2 Direct-mapped

A direct-mapped cache stores an array of cache lines. As it is an array, an index is needed to know which line is being accessed; this index is taken from a part of the address.

To maximize the benefits accruing from spatial locality, the index is taken from the bits of the address immediately above the offset, leaving the bits above that as the tag; this is illustrated in Fig 1.

Fig 1: Breakdown of direct-mapped address. The \log_2(bytes per block) low-order bits are the block offset, the next \log_2(lines in cache) are the index, and the remaining bits are the tag.

Direct-mapped caches are very efficient and can be made far larger than fully-associative caches can. However, they are far less flexible and often have a higher miss rate per unit capacity.

3.2.1 Conflict miss

A direct-mapped cache can suffer from a conflict miss (and the cold miss that any cache can experience). A conflict miss occurs when a different line has been loaded into the same index since a requested line was last accessed. While conflict misses are more likely when a cache is small than when it is large, a conflict miss can occur in a direct-mapped cache of any size with as few as three memory accesses.

Suppose a direct-mapped cache has 16-byte blocks and 256 lines. Then, the addresses 0xbf1234, 0xbf1238 and 0xbf9230 all have the same index (0x23). The first two also have the same tag (0xbf1) and hence are not in conflict with one another; the third has a different tag (0xbf9).

If the following memory addresses are accessed in order, the following notes if it is a hit or miss and if a miss, what type.

address 0xbf1234 – cold miss
address 0xbf1238 – hit
address 0xbf9230 – cold miss
address 0xbf1234 – conflict miss

3.3 Set-associative

A set-associative cache is a compromise between the flexibility and cost of a fully-associative cache and the scale and efficiency of a direct-mapped cache. Structurally, it is an array of sets of lines. Set associative caches dominate hardware, and if a cache with more than a few dozen lines is mentioned without specifying the type, it is generally safe to assume it is set-associative.

Addresses are processed by set-associative caches exactly the same way they are for direct-mapped caches: a tag, index, and block offset (see Fig 1). The index, however, instead of identifying a single line, identifies an entire set of lines. As such, it is traditionally called a set index instead of simply an index. The size of the sets are referred to using the term n-way set-associative cache, meaning each set contains n lines.

Neither capacity nor conflict misses are a perfect match for a set-associative cache, though the terms are sometimes used loosely to distinguish between misses that could have been avoided with a different cache design (conflict) and misses that could only have been avoided by having a larger cache (capacity).

Suppose a 2-way set associative set has 16-byte blocks and 256 lines and uses a least-recently used policy to decide which block to replace within a set.

Since there are 2 lines per set, the cache has 128 sets. Then, the cache uses 4 offset bits, and 7 index bits.

The addresses 0xbf1234, 0xbf1238, 0xbf9230, 0xbfaa32 all have the same set index (0x23). The first two also have the same tag (0xbf1) and hence are not in conflict with one another; the third has a different tag (0xbf9).

If the following memory addresses are accessed in order, the following notes if it is a hit or miss and if a miss, what type.

address 0xbf1234 – cold miss
- afterwards: set with index 0x23: 0xbf1230 through 0xbf123f
address 0xbf1238 – hit
- afterwards: set with index 0x23: 0xbf1230 through 0xbf123f
address 0xbf9230 – cold miss
- afterwards: set with index 0x23: 0xbf1230 through 0xbf123f; 0xbf9230 through 0bf923f
address 0xbfaa32 – cold miss
- since set only has 2 lines in it, the least recently used 0xbf1230 through 0xbf123f is replaced
- afterwards: set with index 0x23: 0xbfaa30 through 0xbfaa3f; 0xbf9230 through 0bf923f
address 0xbf1234 – conflict miss
- since set only has 2 lines in it, the least recently used 0xbf9230 through 0xb9123f is replaced
- afterwards: set with index 0x23: 0xbfaa30 through 0xbfaa3f; 0xbf1230 through 0bf123f

4 Hierarchies of memory caches

A block diagram of a cache hierarchy featuring two cores with: each core connected to an L1 instruction cache and an L1 data cache. The L1 caches for each core are in turn connected to an L2 cache labeled as 'unified' for the core. The two L2 caches are then connected to a shared L3 cache. The L3 cache is, in turn, connected to main mmeory. — A possible cache hierarchy for a two-core system. Different systems will have different numbers of levels of cache and different ways of sharing those caches between cores and between data and instructions.

Most modern end-user computers have several caches for the contents of memory arranged in a hierarchy.

L1

The L1 (level 1) caches are the smallest and fastest caches. It is virtually always placed directly on the processor chip itself and sized so that at least one access to L1 can complete each cycle¹. It is often designed to work on virtual addresses rather than physical or to have the set index be entirely within the page offset so that the TLB lookup (see next section) can be performed concurrently with L1 lookup.

Usually, each core will have two separate L1 caches, one for instructions and one for data.

L2–Ln

The L2 cache is larger than the L1, which means both that it is more likely that any given access will hit in the L2 and that each access takes a bit longer. The L3 is bigger (and thus slower) than the L2, and so on.

When data is not in an L1 cache, the L1 cache will ask the L2 cache for that data and so on, until the last level of cache, which will communicate with main memory.

Often multiple L1 caches will use a common L2 cache and multiple L2 caches will use a common L3 cache and so on. As of 2023, most commonly in desktop or laptop processors, there is an L2 cache for each processor core (used for both instructions and data) and one or more L3 caches shared between several processors cores on a single chip.

5 Page table caches (TLBs)

In addition to caches for the contents of memory, processors typically also have caches for page table entries.

These caches are called TLBs (translation lookaside buffers).

A TLB tends to be a small set-associative cache with fairly large sets that holds a single page table entry per block. TLBs only hold the final PTE encountered in translating an address, the one that has the PPN of the final memory page.

The TLB is often a read-only cache. If the page table is modified, the TLB may need to be partly or fully flushed: that is, (some of) its entries are marked as invalid, requiring full page table lookups upon next access.

As of 2023, most commonly a laptop or desktop processor core will have a hierarchy of TLBs similar to the hiearchy of memory content caches: separate L1 data and L1 instruction TLBs per core and also one or more L2 TLBs per core.

6 Handling writes

6.1 Write-through versus write-back

Each cache also has a policy for how memory writes are sent to larger components of the cache hierarchy. The two most common such policies are

Write back

An edit is placed only in the cache itself and not sent to the larger layer beyond it until the cache line is going to be evicted. Write-back policies typically use a dirty bit in each cache line to mark which lines have been edited (are dirty) and need to be written upon eviction and which were read without being modified.

Write-back caches make sense when lines are edited many times before being evicted, as they minimize the number of writes that go to the larger caches.

Write through

Each edit is immediately sent to the next-larger layer of the cache hierarchy. Typically, this means that there is some kind of finite-size write queue between the cache layers and that if it is full, the write happens at the speed of the slower cache instead of the faster, but it also means that evicting a cache line does not itself slow down processing.

Write-through caches make sense when lines are rarely edited before being evicted, as they minimize the delay of the eviction action.

There is no need for all of the cache layers to use the same write policy; a write-back L1 coupled with a write-though L2 is a perfectly doable configuration.

6.2 Write-allocate versus write-no-allocate

Another decision cache designers need to make about writes is whether to add values to the cache when they are written. Adding values being written to the cache is called write-allocate. Write-allocate plicies give the most opportuntity to take advantage of locality (assuming there is locality in the writes made by programs). But adding the value may also require reading from the next level of the cache hierarchy when that would not otherwise be needed: for example, writing an 8-byte value to a 64-byte cache block might require first reading in the remaining 56 bytes of that cache block.

Alternatively, caches might avoid such complicated by using a write-no-allocate policy, where writes to values not yet cached will simply be forwarded to the next level of cache.

7 Cache-efficient code

As guiding principles, code is more cache-efficient if

It performs all accesses to one region of memory at the same time
It uses memory efficiently
Addresses are aligned so no single access spans cache lines
When multiple different regions of memory are being accessed, they differ in set index, not just tag

Writing cache-efficient code is a fairly involved topic, well beyond the scope of this course. It often involves counter-intuitive decisions such as adding more loops to code and using data structures with worse asymptotic behavior but better cache locality. As with other forms of performance optimization, cache-efficient code tends to be harder for humans to read, understand, and adjust, so it should be used only when the resulting speed increases are actually needed.

8 Additional references

Dive into Systems’s section on caching
Smith, Cache Memories (ACM Computing Surveys, 1982)

At least in a pipelined sense. The goal is that an L1 hit does not slow down processing at all.↩︎