CS 3330: Performance labs and homeworks

This page is for a prior offering of CS 3330. It is not up-to-date.

The perf labs are working labs; there is not separate lab and homework assignments.

The associated homeworks are both the same file and are submitted as the same submission entry. However, they have distinct due dates. I’ll take a snapshot of the submissions at rotate’s due date and use it to compute rotate’s score.

You may work with a single partner for both smooth and rotate, or may work alone.

Q: Can I switch partners between rotate and smooth?
A: It can be done, but you’ll need to talk to Prof Reiss so he can fix the grading script to handle your swap.

Q: Can I have a partner for rotate and then work solo for smooth?
A: Yes

Q: Can I work solo for rotate and then have a partner for smooth?
A: It can be done, but you’ll need to talk to Prof Reiss so he can fix the grading script to handle your joining up.

Q: Can we work in a group of three?
A: No

Q: Can my partner and I be from different lab sections?
A: Yes, if you can both attend the same lab section for this assignment.

The following kinds of conversations are permitted with people other than your partner or course staff:

drawings and descriptions of how you want to handle memory accesses across an image
code snippets explaining optimizations, but only for code unlike the smooth and rotate problems we are discussing
discussion of the code we provide you: how RIDX works, what the naive implementations are doing, etc.

1 Set Up

Download perflab-handout.tar on linux (last modified 2017-04-06).
run tar xvf perflab-handout.tar to extract the tarball, creating the directory perflab_handout.
Edit kernels.c; on lines 12 through 20 you’ll see a team_t structure, which you should initialize with your name and email (and your partner’s, if you have one), as well as what you’d like your team to be called in the scoreboard.
kernels.c is the only file you’ll get to submit, though you are welcome to look at the others or even modify them for debugging purposes.

2 Overview

This assignment deals with optimizing memory-intensive code. Image processing offers many examples of functions that can benefit from optimization. In this lab, we will consider two image processing operations: rotate, which rotates an image counter-clockwise by 90◦, and smooth, which smooths or blurs an image.

For this lab, we will consider an image to be represented as a two-dimensional matrix M, where M_i,j denotes the value of (i, j)th pixel of M. Pixel values are triples of red, green, and blue (RGB) values. We will only consider square images. Let N denote the number of rows (or columns) of an image. Rows and columns are numbered, in C-style, from 0 to N − 1.

2.1 Rotate

Given this representation, the rotate operation can be implemented quite simply as the combination of the following two matrix operations:

Transpose: For each (i, j) pair, M_i,j and M_j,i are interchanged.
Exchange rows: Row i is exchanged with row N − 1 − i.

This combination is illustrated in the following figure:

2.2 Smooth

Smooth is designed to benefit from different optimizations than rotate

The smooth operation is implemented by replacing every pixel value with the average of all the pixels around it (in a maximum of 3 × 3 window centered at that pixel).

Consider the following image:

In that image, the values of pixels M2[1][1] is

2	2
∑	∑	`M1[i][j]`
i = 0	j = 0

In that image, the value of M2[N-1][N-1] is

N − 1	N − 1
∑	∑	`M1[i][j]`
i = N − 2	j = N − 2

2.3 Code

2.3.1 Structures we give you

A pixel is defined in defs.h as

typedef struct {
    unsigned short red;
    unsigned short green;
    unsigned short blue;
} pixel;

Images are provided in flattened arrays and can be accessed by RIDX, defined as

#define RIDX(i,j,n) ((i)*(n)+(j))

by the code nameOfImage[RIDX(index1, index2, dimensionOfImage)].

All images will be square and have a size that is a multiple of 32.

2.3.2 What you should change

In kernel.c you will see several rotate and several smooth functions.

naive_rotate and naive_smooth should not be changed. We will compare your code to the original naive code; if you change the naive code you won’t be able to tell how well you are doing.
You may add as many other rotate and smooth methods as you want. You should put each new optimization idea in its own method: rotate_outer_loop_unrolled_3_times, smooth_with_2_by_3_blocking, etc. The driver will compare all your versions as long as you register them in the register_rotate_functions or register_smooth_functions methods.

3 Driver

The source code you will write will be linked with object code that we supply into a driver binary. To create this binary, you will need to execute the command

unix> make driver

You will need to re-make driver each time you change the code in kernels.c. To test your implementations, you can then run the command:

unix> ./driver

The driver can be run in four different modes:

Default mode, in which all versions of your implementation are run.
Autograder mode, in which only the fastest rotate and smooth functions are displayed.
File mode, in which only versions that are mentioned in an input file are run.
Dump mode, in which a one-line description of each version is dumped to a text file. You can then edit this text file to keep only those versions that you’d like to test using the file mode. You can specify whether to quit after dumping the file or if your implementations are to be run.

If run without any arguments, driver will run all of your versions (default mode). Other modes and options can be specified by command-line arguments to driver, as listed below:

-g: Display only the fastest functions (autograder mode).
-f <funcfile> Execute only those versions specified in <funcfile> (file mode).
-d <dumpfile>: Dump the names of all versions to a dump file called <dumpfile>, one line to a version (dump mode).
-q: Quit after dumping version names to a dump file. To be used in tandem with -d. For example, to quit immediately after printing the dump file, type ./driver -qd dumpfile.
-h: Print the command line usage.

4 About our Testing Server

We will be testing the performance of this program on our machine. We will be build your programs with the same compiler as on the lab machines. For this compiler gcc --version outputs gcc-4.9 (Ubuntu 4.9.4-2ubuntu1~14.04.1) 4.9.4. We will compile your submitted files using the options -g -Wall -O2 -std=gnu11.

Our testing server has an Intel Skylake processor with the following caches:

A 32KB, 8-way set associative L1 data cache;
A 32KB, 8-way set associative L1 instruction cache;
A 256KB, 4-way set associative L2 cache, shared between instructions and data;
A 8MB, 16-way set associative L3 cache, shared between instructions and data;

The size of a cache block in each of these caches is 64 byte.

Things about our processor that some students might want to know but probably aren’t that important:

Our processor also has a 4-way set associative 64-entry L1 data TLB, an 8-way set associative 64-entry L1 instruction TLB, and an 6-way set associative 1536-entry L2 TLB (shared between instructions and data).

5 Grading

5.1 Rules

Violations of the following rules will be seen as cheating, subject to penalties that extend beyond the scope of this assignment:

You must not modify or interfere with the timing code
You must not attempt to inspect or modify the network, disk, or other grading platform resources

Additionally, the following rules will result in grade penalties within this assignment if violated:

You must write valid C code.
You should not turn in code that contains print statements.
Your code must work (i.e., rotate must rotate and smooth must smooth, the same functionality as the provided naive implementations) for any image of any multiple-of-32 dimension (32, 64, 96, etc).

5.2 Grading

Speedups vary wildly by the host hardware. I have scaled the grade based on my timing server’s hardware so that particular strategies will get 75% and 100% scores.

Smooth and Rotate will each be weighted as a full homework assignment in gradebook.

Note (added 12 April): In retrospect, I probably set the 1.6x threshold for rotate a little too low to ensure that you did the kind of solution we wanted, but we’re not going to change this this late. We would recommend, however, aiming for something >1.65x to ensure that you learned from this assignment what we want you to know for the final. (Our suite of example solutions was built for different problem sizes and this turned out to make it not representative.)

Rotate will get 0 points for 1.0× speedups on my computer, 75% for 1.30×, and 100% for 1.6× speedups, as expressed in the following pseudocode:

if (speedup < 1.0) return MAX_SCORE * 0;
if (speedup < 1.3) return MAX_SCORE * 0.75 * (speedup - 1.0) / (1.3 - 1.0);
if (speedup < 1.6) return MAX_SCORE * (0.75 + 0.25 * (speedup - 1.3) / (1.6 - 1.3));
return MAX_SCORE;

Smooth will get 0 points for 1.0× speedups on my computer, 75% for 1.30×, and 100% for 1.97× speedups, as expressed in the following pseudocode:

if (speedup < 1.0) return MAX_SCORE * 0;
if (speedup < 1.3) return MAX_SCORE * 0.75 * (speedup - 1.0) / (1.3 - 1.0);
if (speedup < 1.97) return MAX_SCORE * (0.75 + 0.25 * (speedup - 1.3) / (1.97 - 1.3));
return MAX_SCORE;

Exact speedups may vary by computer; see this viewing page to view your latest results on my machine. Experience has shown that timing does not change by more than .04× upon resubmission. We will also post a scoreboard of everyone’s latest submission times.

5.3 Submission

You will submit only kernels.c, and need submit it only once per homework though you probably want to submit it often to see what my server’s timing results are.

Submit at the usual place. One of my machines will attempt to time your code and post the speedups I find here. Since running all of the timing for everyone’s code can take a while, expect 30-60 minute delays before your results are posted. I time all of your code and report only the fastest run of each method, so it is in your interest to have several optimizations in your submission; note, however, that if your code runs for more than a couple minutes I will stop running it and post no results.

If you are in a team, it does not matter to me if one or both partners submit so long as at least one does.