Changelog

Your task

  1. For all of the tasks below, answer the corresponding question on the answer sheet. Remember that for the lab (but not the corresponding homework) you may collaborate with other students; if you do please, note this in the comments on the answer sheet.

Linking

  1. (answer sheet part 1) Using the supplied object file format description

    • translate these Y86 files sum-main.ys and sum.ys into “object files” in our special human-readable format, and copy-and-paste it to the answer sheet. Assume that all labels declared in these assembly files become symbols in the corresponding object file’s symbol table (even without any directives like .global).

    You may (and we would strongly recommend) use the yas (Y86 assembler tool) distributed with HCLRS to help you translate the assembly to machine code. You will still need to identify what relocation and symbol table entries to create manually (yas generates our equivalent of an executable). See the instructions below.

ISA tradeoffs

  1. (answer sheet part 2) Consider a fixed-instruction-length ISA. Complete the table below describing options for instruction layout and register count:

    instruction size (bits) opcode size (bits) maximum register count (2 operands) maximum register count (3 operands)
    8 3 4 2
    16 4 64 16
    16 5 ~ ~
    16 8 ~ ~
    32 10 ~ ~

    (You may assume the ISA only uses power-of-two register counts.)

    Record your answers on the answer sheet.

  2. We have supplied some statistics from instructions in some benchmark programs described below, which will be used for couple of the below questions and some questions on the corresponding homework.

  3. (answer sheet part 3) For three of the benchmarks, we provide statistics about them compiled with 8 registers available and 16 registers available. (For example, for the “gull” program, the CSV file or spreadsheet sheet “gull-with-8-regs” shows statistics from a version compiled with 8 registers available, and “gull” from a version compiled with the standard 16.) Answer the following questions about these statistics:

    • (subpart a) Suppose we are comparing two processors, one with 8 registers with a 2.1 GHz clock rate (2.1 billion clock cycles per second) where instructions that only access registers take 1 cycle and instructions that access memory (either read or write or both) take an average of 3 cycles and another with 16 registers, a 2GHz clock rate and the same cycles per instruction.

      For the “queens” and “blocked matmul” programs, how much slower or faster would the benchmark programs be on the 16-register procesor based on our benchmark results? Write the estimate you compute to at least two significant figures and describe the calculation you performed.

      (Note that to compute the number instructions that access memory, you’ll need to use the supplied counts of instructions that read memory, that write memory, and that both read and write memory.)

    • (subpart b) Typically less available registers means that the compiler needs to generate more instructions to handle the extra copying to and from registers. In the “gull”, “linear solve” and “queens” program, this means fewer total instructions are executed in the 16 registers version than the 8 register version. But this is not true for the “blocked matmul” program. How this happened for the “blocked matmul” program is related to x86-64 allowing most instructions to compute on values in memory (unlike Y86-64, where they must be moved into a register first). Give an example of how a compiler would generate code with fewer instructions when computing on a value primarily kept in memory versus primarily kept in a register.

      [Hint: consider what the compiler has to do when it’s done computing on a variable that it can only temporarily store in a register.]

using YAS

Our textbook authors have written a Y86 assembler, which we distribute with HCLRS. Unlike a normal assembler, yas does not generate the object files (despite using the extension .yo); it assumes that its assembly input is a self-contained program so no separate linking step is needed (and therefore no file with linking information is needed). To use yas:

  1. First download hclrs.tar.

  2. Extract hclrs.tar. You can do this from the command line using tar xvf hclrs.tar, which will create an hclrs directory.

  3. In the hclrs directory, run make.

  4. In a text editor, create an example Y86 assembly program simple.ys:

    start:
        irmovq $100, %rax
        irmovq $1, %rbx
        irmovq $3, %rcx
        irmovq $4, %rdx
    loop:
        subq %rbx, %rax
        addq %rdx, %rcx
        jg loop
        halt
    
  5. Run tools/yas simple.ys to produce simple.yo:

    0x000:                      |         start:
    0x000: 30f06400000000000000 |             irmovq $100, %rax
    0x00a: 30f30100000000000000 |             irmovq $1, %rbx
    0x014: 30f10300000000000000 |             irmovq $3, %rcx
    0x01e: 30f20400000000000000 |             irmovq $4, %rdx
    0x028:                      |         loop:
    0x028: 6130                 |             subq %rbx, %rax
    0x02a: 6021                 |             addq %rdx, %rcx
    0x02c: 762800000000000000   |             jg loop
    0x035: 00                   |             halt
    
  6. This file includes the machine code for the assembly program on the right-hand side. For example, reading the line 0x00a: 30f30100000000000000 tells us that the byte with index 0xa is 0x30, the byte with index 0xb is 0xf3, with 0xc is 0x01`, and so on.

    Note that this format is deliberately very similar to what our object file format expects, so you can copy-and-paste most of the instructions.

  7. yas only handles complete assembly programs, so if you try to assemble a file which references a label from elsewhere it will fail. To deal with this, I would recommend replacing the label with one that’s actually present in the assembly file so yas will work, and then manually determining where that label’s address was placed.

benchmark statistics

We’ve built a tool that records the activity of x86-64 Linux programs and reports statistics about the instructions executed by that program. We’ve supplied the output of this program on several different benchmark programs:

For the first four programs, we provide two sets of statistics: one from the program’s compiled normally and one from them compiled restricted to 8 registers.

You can download these statistics in this archive (or, alternately, in this Google Spreadsheet). The CSV files each contain a table, whose columns are labeled in the second row of the output and are:

There are separate CSV files for each benchmark and a “combined.csv” file which merges them together (in case you find that more convenient to work with).

The “types” we’ve categorized instructions into are not mutually exclusive, for example:

(In addition, we supply the source code for the custom programs involved.)

Note that some of the categories of statistics overlap; for example, we report both “instructions that write memory” and “non-mov/cmov/push instructions that write memory”.

counting

Our statistics include counts of both the number of times an instruction is present and the number of times it is actually “issued” (or executed). To give an example to explain what these means, given the program:

      main:
    movq $0, %rax
start_loop:
    addq $1, %rax
    cmp $20, %rax
    jle start_loop
    ret

    

we would say that the cmp instruction is in the program 1 time and is issued 20 times (the number of iterations the loop is executed).

Our statistics categorize instructions into categories like “register-to-register unconditional mov”. Many categories we provide overlap.

(I believe these statistics are mostly correct, but given the complexities of obtaining them and interpreting x86-64 instructions, it’s likely there are some inaccuracies.)

calculating with statistics example

Let’s suppose we wanted to calculate what portion of executed instructions read or wrote memory. The most relevant categories in the statistics files are:

  1. “instructions that write memory”
  2. “non-mov/cmov/push instructions that write memory”
  3. “instructions that read memory”
  4. “non-mov/cmov/push instructions that read memory”
  5. “instructions [that] both read AND write memory”

The count for category 2 is included in category 1, and for category 4 is included in cateogry 3. Therefore, we don’t need to include categories 2 and 4 in our calculation.

The count for category 5 is instructions that are both in categories 1 and 3. So the calculation of the number of executed instructions that read or write memory is:

      #(instructions that write memory) + #(instructions that read memory) - #(instructions that both read AND write memory)

    

Reading the statistics from gull.txt, the relevant values are:

      instructions that write memory: 694039854
instructions that read memory: 1178240396
instructions that both read AND write memory: 247526991

    

so the count of isntructions that read/write memory is:

      694039854 + 1178240396 - 247526991 = 1624753259

    

and from gull.txt, the total number of instructions executed is 4467170421, so the portion of instructions executed that read/write memory is:

      1624753259/4467170421 =~ 0.3637

    

the underlying tool

Our tool is based on processing a list of executed instructions, so it does not count instructions that are never executed, even if they are present in the executable/library.

Although it is not necessary to complete this lab (or the following homework), you can download our tool here [last updated 2020-09-14]; it has only been tested on x86-64 Linux and requires valgrind and objdump (from GNU binutils) to be installed. (I believe both of these are present on department machines like portal.cs.virginia.edu.)