CS 3330: Lab 8: Timing Real Systems

This page does not represent the most current semester of this course; it is present merely as an archive.

Most computers have some access to a counter that increments every single clock cycle. The FFTW project has a C header file that can give access to this on many platforms and operating systems: http://www.fftw.org/cycle.h (also a local copy if fftw.org is unavailable for some reason).

Setup

Tricks to timing

The basic idea behind timing code is simple:

TimeType before = getTime();
// code you want to time here
TimeType after = getTime();
ElapsedTimeType elapsed = subtract(after, before);

However, there are some nuances to consider:

The resolution of high-resolution timers is not easy to determine. Some run at the processor clock cycle (e.g., a 2.4 GHz processor would have 2,400,000,000 ticks per second) but that can change due to frequency scaling where the clock is slowed down to save power. Some return microsecond or nanosecond counts, but only update the counts ever few hundred microseconds so you get a mix of zero-duration and big-duration measurements. And some run at some other fixed frequency, like 1GHz.

cycle.h does its best to get a timer with at least GHz resolution; it works on all IA32, IA64, and PowerPC chips and operating systems that I've tried it on so far.
High-resolution timers are typically kept per-core. If the OS moves your process from one core to another then all bets are off.
The OS periodically interrupts running processes in an action called a context switch. These always happen for system calls (e.g., input and output, like using printf or cout) but also happen periodically as the operating system runs.
Virtualization can preserve timing, but it generally does not do so. Run code natively for this lab.

cycle.h should work in
- AIX v. 4+
- alpha
- Cray UNICOS
- DEC C
- gcc and clang on any Intel architecture
- icc/ecc
- HP/UX IA64
- MacOS/Mach (Darwin)
- mingw
- MIPS ZBus
- PA-RISC
- PGI
- PowerPC
- S390
- SGI/Irix
- Solaris
- sparc v9
- Visual C++
It should usually work with no extra effort; if it has trouble, look at the cycle.h source to see what needs to be #defineed for your OS. (e.g., for Solaris you need to have the line #define HAVE_GETHRTIME before the line #include "cycle.h").

How to time code

We'll explore timing our systems in this lab. Usign cycle.h, the basic timing design becomes

ticks before = getticks();
// code you want to time here
ticks after = getticks();
double elapsed = elapsed(after, before);

To help us out, we'll want a few methods for keeping arrays sorted:

void keepBiggest(double* arr, int cap, int* size, double val) {
    int i;
    if ((*size) == cap && val < arr[cap-1]) return;

    if ((*size) < cap) arr[(*size)++] = val;
    else arr[(*size)-1] = val;
    
    for(i=(*size)-2; i>=0; i-=1) {
        if (arr[i] < arr[i+1]) {
            val = arr[i];
            arr[i] = arr[i+1];
            arr[i+1] = val;
        }
    }
}

void keepSmallest(double* arr, int cap, int* size, double val) {
    int i;
    if ((*size) == cap && val >= arr[cap-1]) return;

    if ((*size) < cap) arr[(*size)++] = val;
    else arr[(*size)-1] = val;
    
    for(i=(*size)-2; i>=0; i-=1) {
        if (arr[i] > arr[i+1]) {
            val = arr[i];
            arr[i] = arr[i+1];
            arr[i+1] = val;
        }
    }
}

The big/small code can be used as, e.g.:

double a[16]; int size=0; 
for(i=0; i<1000; i+=1) keepBiggest(a, 16, &size, number[i]);
for(i=0; i<*size; i+=1) printf("The %2d-biggest number was %f\n", i, a[i]);

What are you running?

We'll be measuring both operating system delays and hardware delays. Here's how to find out what you are running on the three most popular families of operating systems today:

Linux

You can learn about your operating system by typing

uname -a

and about your hardware by typing

cat proc/cpuinfo

Windows

In the control panel is an icon named "System" which will tell you about your OS and hardware.

OS X

The Apple menu has an item "about this mac" which give some information; more is available through the "More info…" link.

How long is a context switch?

If we just call getticks() over and over again, most of the time we'll see a delay corresponding to the overhead of measuring time itself. For example, on one of my computers I generally see that the following:

ticks before = getticks();
ticks after = getticks();
double elapsed = elapsed(after, before);
printf("%.1f\n", elapsed);

prints 18.0. But sometimes more time elapses; repeating that same code a few hundred thousand times in a loop and using keepBiggest, I find there are many 56.0 results and a few much bigger: several near 7,500, a few near 50,000, and once in a while one over 200,000. In general,

The smallest numbers are the cost of just running the timing code
The small-but-not-smallest are caused by pipeline stalls, branch mis-predictions, needing to move a register into/out of memory, or the like
The smallest big numbers are usually the cost of the OS taking control, noticing no other process needs to run, and then resuming your code
Really big numbers are the OS taking control, then running some other process before resuming yours.

Deliverable `lab8-switch.txt`

Create a file named lab8-switch.txt, which you should submit, in which you put three lines:

Your Name (compID)
Description of computer
smallest biggest-small smallest-big biggest

For example, I might put

Luther Tychonievich (lat7h)
Desktop Core i5 model 30 @ 2.67GHz running Linux 3.2.0-77-generic IA64
18 56 7434 496008

You don't have to be as explicit in describing your computer as I was...

If you want to run this on several systems, put a blank line between each in your file; for example, the computers I had in my office on 19 March 2015 were:

Luther Tychonievich (lat7h)
Desktop Core i5 running Linux 3.2.0-77-generic IA64
18 56 7434 496008

Laptop Core i7-3632QM running Windows 8.1 IA64
183 373 2030 31234

Laptop Core i5-3320M running Linux 3.11-2-amd64
18 50 300 21834

How can we get good timing of a function?

Because it is difficult to predict when a context switch will happen, or even if one has happened, to time code we often use what the textbook calls the "k-best" approach: we run the code many times, timing it each time, and report not the very fastest (which could happen because we swapped to a core that had a different timer) but the 3rd or 4th best time.

We'll use 100 tries, keeping the 4th-best (index 3). This can be done easily by using keepSmallest with capacity 4.

Try it out. What is the k-best time for f1? How about f2?

double data[1<<16];

void f1() {
    int i; double result = 0;
    data[0] = 0; 
    for(i = 0; i < (1<<16); i += 4)
        result += data[i];
    data[0] = result;
}
void f2() {
    int i; double result = 0;
    for(i = 0; i < (1<<16); i += 4)
        result += data[i];
}

Most computers will give very different answers for these two (for example, I get 40937 for f1, but only 20 for f2). Why?

Suggestion: function pointers

Because you'll want to time a lot of code using the k-base approach, I'd suggest making one or more timineg methods that accept functions pointers; for example, the following function times a single-argument function just once (you'd want to time 30 times and report the 4th-best time)

double timer1arg(void (* f)(int), int a) {
    ticks before, after;
    before = getticks();
    f(a);
    after = getticks();
    return elapsed(after, before);
}

and assuming you defined

void example(int x) { ... }

then you can use the timer1arg method like

double timeTaken = timer1arg(&example, 3330);

Again, I don't think timer1arg is particularly useful as it stands, but with a little modification it can be turned into a k-best timer, which will simplify the next part of your code a lot.

Measuring the Cache

The cover of our textbook (also found as figure 6.43 on page 623) is what the authors of the text call the "memory mountain." It is based on a simple program idea: a loop that accesses elements of a vary large array with a given stride. Along the back-right are the 4K arrays, in the front-left are the 64M arrays. Along the back-left we access every double in the array, along the front-right we access only every 64th element. In other words,

void toMeasure(int size, int stride) {
    int i; double result = 0;
    data[0] = 0; 
    for(i = 0; i < size; i += stride)
        result += data[i];
    data[0] = result;
}

is evaluated for every power-of-2 size and every less-than-64 stride. The vertical axis is scaled to be read throughput: megabytes actually read per second, which will be (array size × 8 bytes per double ÷ stride) ÷ (2²⁰ bytes × time used in seconds). Since we don't want to worry about computing seconds on many platforms, we'll use an unlabeled vertical axis instead.

Bigger sizes will help us find capacity misses of different size caches. Bigger strides will demonstrate conflict misses and/or set saturation.

Important: you MUST initialize the data arrays with non-zero values! If you don't, the virtual memory system will send all reads to the same byte.

Have your program create an output file something-about-your-computer.data containing one measurement per line, where each line is a step size, memory size, and throughput separated by spaces, like 1 268435456 7497.0 or 64 65536 6101.2. There should be the same set of strides for every size. A simple example file would be

1 4096 123.45
2 4096 234.56
1 8092 321.09
2 8092 94.15

Strides should be measured in double values (i.e, stride 2 means reading every other double, or every 16th byte). Sizes should be measured in bytes (i.e., an array of 512 doubles has size 512*sizeof(double) = 4096). Small memory sizes tend to work best if you did the large memory sizes first to warm up the cache.

Once you submit, you should be given a link to a graph of the memory mountain your submission creates. Other people's memory mountains can be found in the gallery.

The textbook image uses strides 1 through 64 doubles (8 through 256 bytes) and sizes 4KB through 64MB (512 through (1<<13) doubles). I found their set of strides worked fine, but I needed to use much larger sizes (between 1<<12 and 1<<25 doubles) to see caching effects. Your systems may need their own sizes tweaked.

You'll submit not only the data files, but also the code that created them.

Submit

Submit a C (or C++) program that does your timing in a file named timing.c (or timing.cpp). Submit your lab8-switch.txt as described in the section on context switches. Submit two or more .data files with cache mountains for different machines (e.g., the lab machine and your personal computer). All submission go on the normal submission site.