- functional unit issue rate/etc. Fall 2016 5, 6
    - on Exam 3, not 2
- writing HCL code?
    - no writing, but may need to read or understand how to write
        (no memorizing signals, etc.)

- spatial and temporal locality in loops
    mostly: only innermost loop matters
        exception: not too many iterations of innermost loop

    for (int i = 0; i < N; ++i)
        for (int j = 0; j < M; ++j)
            A[i * N + j] <-- j changes most frequently
                         <-- if M is big --- most of what matters
            B[i * 10]
            C[j * 1000]
    Answer question: What sequence of values is accessed:
        if i = 0: (chosen arbitrarily)
            A[0] (j=0), A[1] (j=1), A[2], A[3], ...
            B[0], B[0], B[0], B[0], B[0], B[0], ..., B[10]
            C[0], C[1000], C[2000], ...
    temporal locality?
        Are we accessing the same value twice close in time?
            B[0] and B[0] and B[0] ...
            if M is small: yes for C, C[0], then M iters later, C[0] again et
    spatial locality?
        Are we accessing closeby values close in time?
            A[0] and A[1] are close
            A[1] and A[2] are close
            ...
            B[0] and B[10] kinda close? not very frequent if M is large (outer loop)?


- LRU and identifying misses, etc.
    - cache parameters and access pattern:
        (1) find the # of index, offset bits
        (2) find the index of everything being accessed in the cache
            (you can ignore the block offset for finding misses)
                if you have array[x] --> assume the array starts at some
                    address (e.g. 0), and work out the index
                hopefully if you have array[x], we've told you enough to
                know that values don't overlap two blocks
        (3) look at each set: [accesses with the same index]
            - go in order to accesses, and
                (a) check if it is already in the set
                    [initially, it's not, probably]
                    no --> miss, yes --> hit
                (b) if it is miss, replace something in the set
                    - if not all the blocks in the set are used, put it there
                    - otherwise, follow replacement policy
                        LRU: least recently accessed block previously
                        (e.g. look at previous accesses to blocks in the set)
            
- fewest misses Fall 2017 Exam 2 -- 5
    - random replacement wins:
        1 miss for every 64B block accessed each iteratoin -
            minus some times where we get lucky
        versus: 1 miss and no times we get lucky
- counting misses with array access Fall 2017 Exam 2 --- 9

- cache parameter effects
    - a "catch" --- what else changes?
        # sets, block size, associativity + increase assoc only
            --> cache size increases
        total size, block size, associativity + increase assoc only
            --> # sets decreases
    - with "normal" programs
    > associativity+ --- less conflict misses
    > cache size+ --- less capacity
    > block size+ --- less compulsory misses, more conflict misses*
        A[0], A[1], A[2]
        block size 1 ---> Miss, Miss, Miss
        block size 2 --> Miss, Hit (because of A[0]), Miss, Hit (because of A[2])...
                        [* depending on what else changes;
                            rule of thumb --- less sets -> more conflicts]
    > number of sets- --- more conflict misses if #blocks/set is the same
    > write-allocate+ --- less misses
- types of misses
    - compulsory --- still a miss with infinite cache size
    - capacity --- still a miss with associativty = # blocks (fully assoc)
    - conflict --- not compulsory or capacity
         evicting but still space elsewhere`
- write policies: write-back/through, (no-)write-allocate
    - writing to a cache?
        [write-allocate?] if it's not in the cache yet, should I put it there?
        [write-back/write-through?] if it's in the cache (now), do I update just that or
            also memory?
        write-back --> need to track what's out of date ("dirty")
            1 extra bit per block
            otherwise: need to write memory on every eviction,
                which would be very expensive
- direct v set-assoc v fully-assoc
    associativity = # blocks / set
    direct-mapped <---> associativity = 1
    fully-associative <---> associativity = # blocks

- bubbling versus stalling
    - two things we call stalling:
        - stalling AN INSTRUCTION
            - keeping the instruction from an advancing in our
              pipeline
        - stalling A REGISTER
            - keeping the registe from being updated

    - stalling instruction D

    F D E M W
    ---------
    E D C B A
    --normal advance:
    F E D C B 
    -- stalling D:
    E D * C B

    to achieve this:
        stall registers containing values for:
            E (PC)
            D (fetch -> decode)
        make * not be another copy of D --> "bubble" register for decode -> execute

    "bubbling" --> insert a NOP
        bubbling a register --> make it output a nop next cycle

- stalling and bubbling for data hazards
    - most of the time, we can forward without stalling
    - otherwise:
        (1) When is the value we need computed?
        (2) When is the value we need used?
    - example: load/use hazard in a our five-stage pipeline with
        mrmovq (%rax), %rax
        addq %rax, %rax
        (1) near the end of the memory stage of mrmmovq
        (2) near the beginning of the execute stage of addq

    if we can't forward:
        (1) at the end of writeback of mrmovq
        (2) near the beginning of decode of addq

    IMPOSSIBLE:
                        1
                      2
        mrmovq   F D E M W 
        addq       F D E

    CORRECT BY STALLING AND FORWARDING:
                        1
                        2
        mrmovq   F D E M W 
        addq       F D D E
        

- data hazards and load/store/etc.
    - load creates hazards which need stalling in five-stage procesor
        because memory is after execute
    - other pieplines would have different hazards
        e.g. two decode stages or different execute stages

- control hazards 
    - didn't know correct PC in fetch
    - examples: RET (need to read it from memory), Jxx (need conditoin codes)
    - solution 1: don't fetch new instrutions until we know the PC
        - "stall" the next instruction
    - solution 2: make a guess, and correct it later (branch prediction)
        - later on, replace instructions with nops
        - also, keep any permanent changes from being made
            (e.g. don't let them update condition codes,
                or restore old values of changed regsiters/memory/etc.)

- aliasing: what it is/why it matters?
    - two names for one value
        example:
        int A[100];
        int *x = &A[10];
        int *y = &A[20];
        x[10] is the same thing as y[0]

    - matters because it makes some optimizations we want to do
      INCORRECT.
      e.g. changing loop order on:
        for (int i = 0; i < N; ++i)
            for (int j = 0; j < N; ++j)
                A[i * N + j] = B[j * N + i];
      could be wrong if A and B setup like x, y above
      e.g. removing redundant loads through a pointer