CS 3330: SIMD lab

This page is for a prior offering of CS 3330. It is not up-to-date.

Note: The .tar file for this lab was updated 13 November around 2PM to fix a bug where tests would occassionally hang.

1 Introduction

In this lab, you will we use SIMD (Single Instruction Multiple Data) instructions, also known as vector instructions, available on our lab machines in order to produce more efficient versions of several simple functions.

Vector instructions act on vectors of values. For example

paddw %xmm0, %xmm1

uses 128-bit registers, %xmm0 and %xmm1. But instead of adding two 128-bit integers together, it treats each register has a vector of eight 16-bit integers and adds each pair of 16-bit integers. The instructions we will be using in this lab are part of versions of Intel’s SSE (Streaming SIMD extensions) to the x86 instruction set. (The book, on page 546, also talks about a later Intel extension, AVX (Advanced Vector eXtensions), but these are not supported by our lab machines.)

Rather than writing assembly directly, we will have you use Intel’s Intrinsics functions to do so. For example, to access the paddw instruction from C, you will instead call the special function _mm_add_epi16. Each of these functions will compile into particular assembly instructions, allowing us to specify that the special vector instructions should be used without also needing to write all of our code in assembly.

You will create vectorized versions of three functions in the lab. (Since this is our first time offering this lab, we are not sure how long this will take to complete. If you don’t finish everything during the lab time, submit whatever you completed.)

We believe we have included all the information you need to complete this lab in this lab, but we also have a more reference-like explanation of the Intel intrinsics here.

1.1 A note on compiler vectorizations

Compilers can sometimes generate vector instructions automatically. The Makefile we have supplied in this lab has optimization settings where the compiler on our lab machines will not do this. We believe this will also be the case with other compilers, but we have not tested all of them.

The purpose of this lab is to familiarize you with how to use vector operations, so you can deal with more complicated problems where the compiler will not do a good enough job and understand what compilers are doing better.

2 Lab Files

First download simdlab.tar. (Last updated: 13 November 2017, 2:00pm.) Inside it, you will find:

Run make to build the various executables for this lab.

3 General SIMD reference

We have tried to include the information about the Intel SSE intrinsic functions We provide a partial reference to the Intel SSE intrinsic functions here, which you may wish to refer to.

In addition, the official Intel documentation provides a comprehensive reference of all available functions. Note that our lab machines only support the SSE functions. So, when using the Intel page, only check the boxes labelled SSE through SSE4.2.

4 Example SIMD function

In this lab, you will be creating optimized versions of several functions that use vector instructions. To help you, we have an example created for you already:

Compile this by running make, then run it by running ./add. You will see benchmark results for the two versions of this add two arrays function.

One is this function:

void add_C(long size, unsigned short * a, const unsigned short *b) {
    for (long i = 0; i < size; ++i) {
        a[i] += b[i];
    }
}

The other is a version that accesses vector instructions through special intrinsic functions:

/* vectorized version; assumes unsigned short is a short or unsigned short */
void add_SSE(long size, unsigned short * a, const unsigned short *b) {
    for (long i = 0; i < size; i += 8) {
        /* load 128 bits from a */
        /* a_part = {a[i], a[i+1], a[i+2], ..., a[i+7]} */
        __m128i a_part = _mm_loadu_si128((__m128i*) &a[i]);
        /* load 128 bits from b */
        /* b_part = {b[i], b[i+1], b[i+2], ..., b[i+7]} */
        __m128i b_part = _mm_loadu_si128((__m128i*) &b[i]);
        /* a_part = {a[i] + b[i], a[i+1] + b[i+1], ...,
                     a[i+7] + b[i+7]}
         */
        a_part = _mm_add_epi16(a_part, b_part);
        _mm_storeu_si128((__m128i*) &a[i], a_part);
    }
}

4.1 New Types

An __m128i represents a 128-bit value that can be stored on one of the special 128-bit %xmm registers on our lab machines. The i indicates that the 128-bit value contains an array of integers. In this case, they are 16-bit integers, but we can also work with other sized integers that fit in 128 bits.

Whenever we want to get or use a __m128i value, we will use one of the special functions whose name begins _mm. You should not try to extract values more directly. (This may compile, but will probably not do what you expect and may differ between compilers.)

We also have some functions that take a __m128i*. This is a pointer to a 128-bit value which can be loaded into a 128-bit register. When we cast &a[i] to this type we are indicating that we mean 128 bits starting with a[i]. Since each element of a is 16 bits, this means a[i] up to and including a[i+7].

4.2 New intrinisc functions

To manipulate the 128-bit values we use several intrinsic functions:

Each of these functions will always be inlined, so we do not need to worry about function call overhead. Most of the special of _mm function will compile into exactly one instruction (as you can see below)

The epi16 part of some function names probably stands for extended packed 16-bit, indicating that it manipulates a vector of 16-bit values.

4.3 Intrinsics and assembly

Typical assembly code generated for add_SSE above looks like:

add_SSE:
  // size <= 0 --> return
  testq %rdi, %rdi
  jle end_loop

  // i = 0
  movl $0, %eax

start_loop:
  // __m128i b_part = _mm_loadu_si128((__m128i*) &b[i]);
  movdqu (%rdx,%rax,2), %xmm0

  // __m128i a_part = _mm_loadu_si128((__m128i*) &b[i]);
  movdqu (%rsi,%rax,2), %xmm1

  // a_part = _mm_add_epi16(a_part, b_part);
  paddw %xmm1, %xmm0

  // _mm_storeu_si128((__m128i*) &a[i], a_part)
  movups %xmm0, (%rsi,%rax,2)

  // i += 8
  addq $8, %rax
  
  // i < size --> return
  cmpq %rax, %rdi
  jg start_loop
end:
  ret

(You can see the actual code in add_benchmarks.s.)

(Various details will vary between compilers, and with some optimization settings, compilers might try to perform other optimizations, like loop unrolling.)

Each of the _mm_ functions corresponds directly to an assembly instruction:

5 Task 1: Sum with Intel intrisics

The first task will be to create a version of sum:

unsigned short sum_C(long size, unsigned short * a) {
    unsigned short sum = 0;
    for (int i = 0; i < size; ++i) {
        sum += a[i];
    }
    return sum;
}

that uses vector instructions through the intrinsic functions.

We have provided some implementations of this function in sum_benchmarks.c and sum_clang5_O.s. Compile and link them by running make and then run ./sum to compare their performance on various sizes.

You will add a new function sum_benchmarks.c that uses the vector intrinsics.

Start by making a copy of the provided sum_with_eight_accumulators that uses 8 accumulators:

unsigned short sum_with_eight_accumulators(long size, unsigned short *a) {
    unsigned short partial_sum0 = 0, partial_sum1 = 0, partial_sum2 = 0, partial_sum3 = 0,
                   partial_sum4 = 0, partial_sum5 = 0, partial_sum6 = 0, partial_sum7 = 0;
    for (int i = 0; i < size; i += 8) {
        partial_sum0 = partial_sum0 + a[i+0];
        partial_sum1 = partial_sum1 + a[i+1];
        partial_sum2 = partial_sum2 + a[i+2];
        partial_sum3 = partial_sum3 + a[i+3];
        partial_sum4 = partial_sum4 + a[i+4];
        partial_sum5 = partial_sum5 + a[i+5];
        partial_sum6 = partial_sum6 + a[i+6];
        partial_sum7 = partial_sum7 + a[i+7];
    }
    return partial_sum0 + partial_sum1 + partial_sum2 + partial_sum3 +
           partial_sum4 + partial_sum5 + partial_sum6 + partial_sum7;
}

Rename this copy sum_SSE.

Since the loop performs eight independent additions of 16-bit values, it can be changed to use a single call to _mm_add_epi16:

Then, add up these eight partial sums.

When you’ve completed this sum_SSE function, add it to the list of functions in sum_benchmarks.c, then run make to compile it. Then compare its performance to the other versions using ./sum.

Also examine the assembly code the compiler generated for your sum_benchmarks.c in sum_benchmarks.s.

(It is also possible to perform the last 8 additions in parallel, without copying to the stack first, but for simplicitly and because it has a small effect on performance, we will not require that here.)

6 Task 2: Vectorized min

Using the same idea as you used to vectorize the sum, create a vectorized version of this min function:

short min_C(long size, short * a) {
    short result = SHRT_MAX;
    for (int i = 0; i < size; ++i) {
        if (a[i] < result)
            result = a[i];
    }
    return result;
}

which you can find in min_benchmarks.c. Create a new version of this that acts on __m128i variables containing eight elements of the array at a time. Some intrinsic functions that will be helpful (you can also refer to our reference page or the Intel documentation):

After adding your vectorized function to min_benchmarks.c and adding it to the list of functions, test it by running make and then ./min.

7 Task 3: Vectorize dot-product

Now let’s vectorize the following function:

unsigned int dot_product_C(long size, unsinged short *a, unsigned short *b) {
    unsigned int sum;
    for (int i = 0; i < size; ++i)
        sum += a[i] * b[i];
    return sum;
}

Note that this function computes its sums with unsigned ints instead of unsigned shorts, so you’ll need to add 32-bit integers instead of 16 bit integers. So, you will have 128-bit values which contain four 32-bit integers instead of eight 16-bit integers. To obtain these originally, you’ll need to convert the 16-bit integers you read from the array into 32-bit integers; fortunately, there is an vector instruction (and intrinsic function) to do this quickly. To manipulate these as 32-bit integers, you will use functions containing epi32 in their names instead epi16 name, which correspond to different vector instructions.

Some intrinsic functions which may be helpful:

Like you did with sum, you can add up partial sums at the end by storing them in a temporary array on the stack.

Note that since you are adding vectors of four 32-bit values, your loop will probably act on four elements at a time (even though you can load more at once). For now, don’t worry about loading 128-bits into an __m128i and only using 64 bits of them.

After adding your vectorized function to dot_product_benchmarks.c and adding it to the list of functions, test it for correctness by running make and then ./dot_product.

(It’s possible that your first vectorized version will be slower than the original because you are not using multiple accumulators. Although the vector instructions can perform more computations per unit time, they tend to have high latency.)

8 (optional) Task 4: Optimize the vectorized dot-product

Make a copy of your vectorized dot-product function and see how it is affected by applying various possible optimizations. Things you might try include:

See if you can match or beat the performance of the supplied version of dot_product_C compiled with GCC 7.2 with the optimization flags -O3 -msse4.2 — or at least try to make it faster than the original plain C version, if it wasn’t. If you are using your labtop, check if the performance difference on your laptop consistent with the lab machines.

9 Submission

Run make simdlab-submit.tar to create an archive of your C files (called simdlab-submit.tar) and upload it to archimedes.

Please submit whatever you completed within lab time, even if it is less than the whole lab.

Copyright © 2016–2017 by Samira Khan, Luther Tychonievich, and Charles Reiss.
Last updated 2017-10-27 19:08:42