## Virtual Memory

Samira Khan Apr 27, 2017

### Virtual Memory

- Idea: Give the programmer the illusion of a large address space while having a small physical memory
  - So that the programmer does not worry about managing physical memory
- Programmer can assume he/she has "infinite" amount of physical memory
- Hardware and software cooperatively and automatically manage the physical memory space to provide the illusion.
  - Illusion is maintained for each independent process

### Basic Mechanism

- Indirection (in addressing)
- Address generated by each instruction in a program is a "virtual address"
  - $\bullet\,$  i.e., it is not the physical address used to address main memory
- An "address translation" mechanism maps this address to a "physical address"
  - Address translation mechanism can be implemented in hardware and software together

"At the heart [...] is the notion that 'address' is a concept **distinct** from 'physical location."" Peter Denning















# Two Problems Two problems with page tables Problem #1: Page table is too large Problem #2: Page table is stored in memory Before every memory access, always fetch the PTE from the slow memory? Large performance penalty















# Per-Process Virtual Address Space • Each process has its own virtual address space • Process X: text editor • Process Y: video player • X writing to its virtual address 0 does not affect the data stored in Y's virtual address 0 (or any other address) • This was the entire purpose of virtual memory • Each process has its own page directory and page tables • On a context switch, the CR3's value must be updated X's PAGE\_DIR Y's PAGE\_DIR

# Two Problems • Two problems with page tables • Problem #1: Page table is too large • Page table has 1M entries • Each entry is 4B (because 4B ≈ 20-bit PPN) • Page table = 4MB (II) • very expensive in the 80s • Solution: Hierarchical page table • Problem #2: Page table is in memory • Before every memory access, always fetch the PTE from the slow memory? → Large performance penalty

## Speeding up Translation with a TLB

- Page table entries (PTEs) are cached in L1 like any other memory word
  - PTEs may be evicted by other data references
  - PTE hit still requires a small L1 delay
- Solution: Translation Lookaside Buffer (TLB)
  - Small set-associative hardware cache in MMU

  - Maps virtual page numbers to physical page numbers
     Contains complete page table entries for small number of pages













# Context Switches • Assume that Process X is running • Process X's VPN 5 is mapped to PPN 100 • The TLB caches this mapping • VPN 5 → PPN 100 • Now assume a context switch to Process Y • Process Y's VPN 5 is mapped to PPN 200 • When Process Y tries to access VPN 5, it searches the TLB • Process Y finds an entry whose tag is 5 • Hurray! It's a TLB hit! • The PPN must be 100! • ... Are you sure?

### Context Switches (cont'd)

- Approach #1. Flush the TLB
  - Whenever there is a context switch, flush the TLB
    - All TLB entries are invalidated

  - Example: 80836
     Updating the value of CR3 signals a context switch
     This automatically triggers a TLB flush
- Approach #2. Associate TLB entries with processes

  - All TLB entries have an extra field in the tag ...
    That identifies the process to which it belongs
    Invalidate only the entries belonging to the old process
  - Example: Modern x86, MIPS

## Handling TLB Misses

- $\bullet$  The TLB is small; it cannot hold  $\underline{\mathsf{all}}$  PTEs
  - Some translations will inevitably miss in the TLB
  - Must access memory to find the appropriate PTE
    - Called walking the page directory/table
    - Large performance penalty
- Who handles TLB misses?
  - 1. Hardware-Managed TLB
  - 2. Software-Managed TLB

### Handling TLB Misses (cont'd)

- Approach #1. Hardware-Managed (e.g., x86)
  - The hardware does the page walk
  - The hardware fetches the PTE and inserts it into the TLB
  - If the TLB is full, the entry **replaces** another entry
  - All of this is done transparently
- Approach #2. Software-Managed (e.g., MIPS)
  - The hardware raises an exception
  - The operating system does the page walk
  - The operating system fetches the PTE
  - The operating system inserts/evicts entries in the TLB

## Handling TLB Misses (cont'd)

- Hardware-Managed TLB
  - Pro: No exceptions. Instruction just stalls
     Pro: Independent instructions may continue

  - Pro: Small footprint (no extra instructions/data)
     Con: Page directory/table organization is etched in stone
- Software-Managed TLB
  - Pro: The OS can design the page directory/table
     Pro: More advanced TLB replacement policy

  - Con: Flushes pipeline
  - Con: Performance overhead

### Address Translation and Caching

- When do we do the address translation?
   Before or after accessing the L1 cache?
- In other words, is the cache virtually addressed or physically addressed?
   Virtual versus physical cache
- What are the issues with a virtually addressed cache?
- Synonym problem:
   Two different virtual addresses can map to the same physical address → same physical address can be present in multiple locations in the cache → can lead to inconsistency in data

Homonyms and Synonyms

- Homonym: Same VA can map to two different PAs
   Why?
   VA is in different processes
- Synonym: Different VAs can map to the same PA

  - Why?

    Different pages can share the same physical frame within or across processes

    Reasons: shared libraries, shared data, copy-on-write pages within the same process, ...
- Do homonyms and synonyms create problems when we have a cache?
  - Is the cache virtually or physically addressed?







### Sanity Check • Core 2 Duo: 32 KB, 8-way set associative, page size ≥ 4K • Cache size $\leq$ (page\_size $\times$ associativity)? • 2<sup>P</sup> = 4K P = 12 • Needs 12 bits for page offset • 2<sup>C</sup> = 32KB, C = 15 Needs 15 bits to address a byte in the cache • 2<sup>A</sup> = 8-way, A = 3 Increasing the associativity of the cache reduces the number of address bits needed to index into the cache Needs 12 bits for cache index and offset, as tags are matched for blocks in the same set • C ≤ P + A ? 15 ≤ 12+3? *True*

### Some Solutions to the Synonym Problem

- Limit cache size to (page size times associativity)
  - get index from page offset
- On a write to a block, search all possible indices that can contain the same physical block, and update/invalidate
  - Used in Alpha 21264, MIPS R10K
- Restrict page placement in OS
   make sure index(VA) = index(PA)

  - Called page coloring
  - Used in many SPARC processors

Today

• Case study: Core i7/Linux memory system













## Virtual Memory

Samira Khan Apr 27, 2017

47