Changelog:

Task

  1. We recommend doing this lab on Linux (a department machine or in a VM). If you do not, then see the compatiblity note below.

    If you experience issues with “Illegal instruction” errors either use a department machine (via NX or via SSHing into portal.cs.virginia.edu) or remove the GCC7 and Clang versions of sum from sum_benchmarks.c.

  2. Download the lab tarball here. Extract the tarball. (Last updated 5 April 2019)

  3. Build the lab with make and run the produced program ./sum. This will benchmark 4 different functions that equivalent to the following C function:

    unsigned short sum_C(long size, unsigned short * a) {
        unsigned short sum = 0;
        for (int i = 0; i < size; ++i) {
            sum += a[i];
        }
        return sum;
    }
    
  4. Create a copy of sum_simple.s, which contains a commented assembly implementation of the above sum function, called sum_unrolled2.s. (See below for an explanation.) Modify this copy to rename the sum function to sum_unrolled2 and unroll the loop to handle two elements per iteration. You do not need to handle sizes that are not multiples of 16.

    Add the resulting sum_unrolled2 function to sum_benchmarks.c, recompile with make, then observe how much faster the unrolled version is by running ./sum.

  5. Repeat the previous step, but unroll to handle 4 elements per iteration (using the name sum_unrolled4).

  6. Create a copy of your sum_unrolled4.s called sum_multiple_accum.s. Rename the function in this copy to sum_multiple_accum and modify it to use multiple accumulators. (That is, multiple variables to store intermediate results, like slide 9 of the pipe6 lecture slides.) Make sure you obey the calling convention when choosing where to store the additional accumulators. Add this to sum_benchmarks.c and observe how much faster it is than the unrolled version.

  7. In sum_benchmarks.c, create a copy of sum_C called sum_multiple_accum_C that uses multiple accumulators like in your assembly solution. Compare its performance to the assembly version.

  8. In a text file called times.txt report the performance for the naive assembly version, and each of the versions you created on the largest size tested.

    If possible and you have time, we suggest testing on multiple machines (e.g. your laptop and a department machine).

  9. Run make looplab-submit.tar to create an archive of all your .s files and the text file. Submit this file to kytos. If you are working remotely on a department machine, our guide to using SSH and SCP or other file transfer tools may be helpful.

Files in the tarball

The supplied version of sum

The 4 versions we have supplied are:

Compatability note

OS X requires that function names have an additional leading underscore in assembly. So, the supplied assembly files will not work on OS X. The easiest thing to do is use Linux for the lab (either via SSH or via a VM). Alternately, you can modify the assembly files to add an _ before the function names (e.g. changing sum_simple: to _sum_simple: and .global sum_simple to .global _sum_simple).

If you experience issues with “Illegal instruction” errors, this means that your processor is a little older than we expected. Either use a department machine (via NX or via SSHing into portal.cs.virginia.edu) or remove the GCC7 and Clang versions of sum from sum_benchmarks.c.

Dealing with large output

You can redirect the output of ./sum to a file using something like

      ./sum > output.txt

    

Then open output.txt in a text editor.

No performance improvement?

Make sure you that your loop unrolling implementation does not increment the index i more than needed.

It is possible that, due to the simple nature of addition, that you will hit the latency bound rather quickly:

Appendix: Timing

Cycle counters

The timing code we have supplied uses the rdtsc (ReaD Time Stamp Counter) instruction to measure the performance of the function. Historically, this accessed a counter of the number of processor clock cycles. On current generation processors, where different processor cores have different clock rates and clock rates vary to save power, that is no longer how rdtsc works. On modern systems, rdtsc accesses the number of cycles of a counter that counts at a constant rate regardless of the actual clock speeds of each core. This means that the cycle counter reliably measures “wall clock” time rather than actually measuring the number of cycles taken.

Since clock rates vary on modern processors, measurements of wallclock time do not have an obvious correlation to number of clock cycles. A particular problem are processor features like Intel Turbo Boost or AMD Turbo Core. (These might generally be called “dynamic overclocking”.) In these cases, processor cores briefly operate at faster than the normal maximum clock rate. This means that microbenchmarks like ours my make the processor appear faster than it would be under normal operation — e.g., if we needed to compute sums repeatedly over a period of time. The cycle counter generally counts clock cycles at the “normal” sustained clock rate.

Taking minimums

The function tries to give the approximate minimum timing, ignoring temporary effects like moving arrays into cache or other things running on the system. To do this, it runs the function until:

It then returns the 5th shortest time (ordinarily within .5% of the shortest time).