# Pipelining - stalling for slow execution units - identifying when you need to forward - load/use hazard: - (computed in) memory -> (used in) execute only solved with stall + forwarding - memory -> memory is solved with forwarding only # Cache / Memory - direct-mapped versus associative - diret-mapped = associative with associativity 1 (one block/set) - fully associative = associative with associativity - tag/index/offset and same set/block/etc. - 11/001/1 -- tag/index/offset - know sizes by # bytes/block (offset), # sets (index), left over (tag) (unique bit pattern for each byte, set) - same index: same set - same index + tag: same block - quiz question - parallel vs serial cache accesses to a SINGLE CACHE - serial option: check tag for match before looking at data - slower but uses less power (no lookup on miss) - maybe don't fetch all blocks in set - parallel option: check tag for match while fetching data - faster but uses more power (start data read earlier) - multiple of levels of cache/memory - access level 1 - if that fails, then access level 2 - almost always serially - caches and processor - latency for memory stage: if cache hit - if cache miss: pipelined processor will stall - wait for "done?" signal - write-back versus write-through - write-through: always send writes to memory IMMEDIATELY - write-back: send writes to memory when you'd forget about value otherwise - (on replacement) - extra state: "dirty bit" -- do we need to write this eventually - one bit per block - DRAM vs SRAM: speed, physical difference - DRAM: denser (more bytes/size) and slower and cheaper --- main memory usually - SRAM: less dense and faster and more expensive --- caches usually - tech trends: CPUs faster >> memory - we didn't cover it - lines versus sets - book: line = block + metadata - other sources sometimes: line = block - set: has multiple blocks/lines (if set-associative cache) - AMAT -- average memory access time (overall cache performance metric) - hit-latency * hit-rate + miss-latency * miss-rate - hit-latency + miss-PENALTY * miss-rate - miss-penalty = miss-latency - hit-latency (extra time for a miss) - different optimizations improve different parts # Cache Performance/Locality - cache blocking (access order in for loops) - loop have temporal locality? --- access something repeatedly - loop have spatial locality? --- access (nearly) adjacent things consecutively # OOO - precise exceptions and reorder buffers - data flow model