- branch prediction jXX
    - processor makes a guess
        - for our assignment: guess is that jle foo will go to foo
    - processor fetches based on the guess even though it doesn't know if it's right
    - later on we figure out whethre the guess is right
        - how much later? example in our processor (pipehw2)
    
            jle    F   D   E**  
                           ^^^---- evaluate conditoin codes here
            guess1     F   D
            guess2         F
                                F
                                ^^^--- correct wrong guess here
    - if the guess is right:
        we do nothing
    - if the guess is wrong:
        we cancel the instructoins we guessed
            in our porcessor: just don't store in pipeline registers
            (by storing a nop instead)

- branch prediction: call/ret
    - call doesn't need a prediction in our processor
        because we figure out the destination of the call in the fetch stage
    - if we couldn't
        (e.g. maybe we had two fetch stages)

    - ret: if we made a predictoin, we'd predict what is the return address
        - done by many real processors by recording the locatoin of the call in
            special registers
        - but in the textbook's design and pipehw2: stall instead of predicting

- direct-mapped v fully-associative v ---
    these are all special cases of an X-way set associative cache
    way: number of spaces to put a value within each set
    set: place to put values determined from address
    if X=1: we call this "direct mapped"
        one-way: no replacement policy
        # of sets = # blocks: only one place to put values from each address
        # of index bits = log_2(# of sets)
    if X=(# of blocks): we call "fully associative"
    if number of sets=1: we call "fully associative" (same as X=(# of blocks)
        one set: no set index bits
- capacity/compulsory/conflict
    - compulsory: first time we're accessing that block
        with this definition: changing the block size, might change # of compulsory misses
    - conflict: something about how the cache blocks are organized makes it miss
        if we had maximum flexibility and we maximally clever about where to place
        blocks, then we wouldn't have this miss
            maximum flexibility --> more ways --> fully associative cache
            being clever --> best possible replacement policy
    - capacity: we just don't have enough room to avoid these misses
        if we had maximum flexibility + maximally clever, we couldn't avoid this miss
            without creating other/more misses
- write-allocate
    write miss: program tries to write and we don't have the value cached, what do we do
        write-allocate: bring it into our cache
            hope: the block it is in will be used later, so this will be faster
        write-no-allocate: don't bring it into our cache
            if we only write to block, don't bother caching
            (just send the write to the next level of cache)
- write-back/write-through
    write-back: when we update our cache due to a write, we do not update the next level yet
        instead, we track that the block is "dirty" and update it before we get rid
        of the updated value
            NOTE: if we don't write to our cache at all, then there's nothing to track

    write-through: when we update our cache due to a write, update the next level immediately
        avoids doing updates later
        
- determining the number of misses with C code
    - if we need to be precise:
        - figure out what sets are accessed
            # of blocks per way in the cache = # sets --> blocks before something
                maps ot the same set
            # of bytes per way in cache = # sets * block size --> # bytes before somethign
                maps to the same set
            for example: 6-way 12KB set-assocative cache:
                    12KB/6 = 2KB per way in the cache: every 2KB maps to the same set

        - look at each set separately and determine what's in throughout the loop
            if you know array[0-3] and array[1024-1027] and array[2048-(2048+3)] etc.
                are in the same set, write down a list of accesses to that set:

                array[0]            array[0-3]  miss
                array[1]            array[0-3]  hit
                array[1024]         array[1024-1027] miss
                    if two-way: array[1024-1027], array[0-3]

                array[1025]         array[1024-1027] hit
                array[2]            array[0-3] miss
                array[3]

            annotate that list with what will be stored in the set and whether
            hit/miss
                    if not direct-mappped, need to track # ways values

        - either: go through all sets that are accessed, or look for pattern
            that lets you avoid manually going through all sets
    - approximation:
        what's being accessed repeatedly in the inner-most loop(s)
            if not too much data to fit in cache, should be hits after first accessj
            quesiton: how many first accesses
        if even innermost loop is too much to fit into cache, look at subsets
                of the innermost loop to see if there's spatial locality between
                iteratoins, etc.

- loop unrolling
    loops have "overhead"
        loop: 
            addq (%r8, %rdi, 8), %rax       <-- "body"
                    M[r8 + rdi * 8]
            incq %rdi           <-- "loop overhead"
            cmpq %rdi, %rsi     <-- "loop overhead"
            jle loop            <-- "loop overhead"
    idea of loop unrolling is that we can avoid doing this loop overhead
        for every single iteratoin of the loop by doing the iterations
        in groups
    example with groups of 2:
        loop:
            addq (%r8, %rdi, 8), %rax       <-- "body"
            addq 8(%r8, %rdi, 8), %rax       <-- "body"
                    M[8 + r8 + rdi * 8]
            addq $2, %rdi           <-- "loop overhead"
            cmpq %rdi, %rsi     <-- "loop overhead"
            jle loop            <-- "loop overhead"
    since groups of two: going to execute the loop overhead half as much overall
        BUT we need to handle the case where the  loop bound is not a power of two

- cache blocking v loop unrolling
    cache blocking is about taking some nested loops and changing the order of how
    the inner body is executing

    given:
        for (i ...)
            for (j ...)
                for (k ...)
                    BODY

    choose another way of iterating over i, j, k so we still execute BODY
    for each i, j, k, but we get better locality

    vs loop unrolling: since we have loops after we do this, we can unroll them
        and we will because it's faster often

    vs loop unrolling: often we'll introduce extra loops, which gives more ways
        to loop unroll:
            for (i' by 16)
                for (j ...)
                    for (k ...)
                        for(i from i' to i'+16) <-- new loop that's good to unroll
                            BODY    
        new loop is especially good to unroll:
            know exact number of iterations (no leftovers)
            can potential unroll it completely ---> no loop overhead left
            

- out-of-order processors: what can execute at the same time
    - assuming we have big enough instructoin queues, reorder buffers, etc.
      and enough things that execute instructions (ALUs, read/write ports on our cache, etc.)
      the only restriction left is "do the instructions depend on each other"

        b/c they could both in instruction queue and we'd just look at
            "are their inputs ready"
            if not depending on each other: inputs become ready at the same time

- architectural versus physical registers
    - out of order processors have lots of types of data hazards that are more complicated
      then our pipelined processor
    - with pipelined processor: we had multiple versoins of a register:
        - the versoin in the register file
        - the version that was just computed in execute, but not stored yet
        - the version that was just computed in memory or sent through the pipeline register to memory, but not stored yet
        - the version that was sent to the writeback stage but not stored yet
    - used forwarding to decide on the correct version

    - with OOO processor: also have multiple versions, but track them EXPLICITLY
        - give a new name --- physical register number to each version
            - if we can't do this, stall until we can
        - use the physical register numbers to actually figure out how to 
            execute instructions

    - physical register number: determines where it stored in our register file
        - typically way more physical registers available than registers in the ISA

    - architectural registers: "old name"
        "register in the ISA"
    - physical registers: "new name"
    
- inclusive/exclusive cachees
    - examples with L1 and L2, but can also consider L2 and l3, etc.
    - inclusive cache:
        everything in L1 is also stored in the L2
            if we evict from the L1, the value will still be in the L2 later
                    (usually good --- because we often evict values which will still
                    be used)
            if L2 talks to multiple L1s, then either can get the value from L2:
                    for example: if we have one L2 for mutliple processors
                        and they share memory, then it doesn't matter which of
                        two processors are accessing the memory location for
                        whether they'll find a value in the L2
            duplicated data: some of the L2 space is basically not used most of the time
    - exclusive cache:
        everything in L1 is not stored in the L2
            if we evict from the L1, probably store the value in the L2 then
                    L1 eviction's more complicated
            if L2 talks to multiple L1s, need to decide what to do when one of
                    the L1 caches wants to store value
            not duplicating data: more of L2 stores data we might access in the future 
                    from it
    - can also have a non-inclusive, non-exclusive cache
        some things in l1 are stored in the l2, but not everything

----


- latency/throughput bounds
    - throughput bound: "how much can the hardware do (if we don't worry about dependencies)"
        if we have two ALUs that each take 5 cycles to do a multiply and are pipelined:
                we can start two multiples every cycle
            even though in practice, this might not be possible because of
                the multiplies depending on each other
        for figuring out the throughput bound for a calculation:
            what's the thing the calculation needs that's slowest
                assuming we work to reorder things to avoid dependencies
                    (e.g. with reassociation/multiple accumulators..)
            for example: if we want to do adds + loading data from the array,
                need to consider how fast can th eprocessor add and
                how fast can the processor load data from the array

    - latency bound: "how much can the hardware do, given that we have to do this thing
                one at a time"
        have some operatoin that needs ot happen one after another
        performance is determined by how long it takes to get one reuslt and then
            feed it into the next operatoin
        consider time from starting coputation to finsihing it  
            NOT what pipelining helps with

        if we have two ALUS that take 5 cycles to do amultiply and are pipelined:
            we can start the next multiple 5 cycles after the first
            we end up doing one multiply every 5 cycles

        latency bound for computation: what's the "critical path" on the data flow graph
            --> critical path = things we need to do one after th eother