This semester we will use SIMD (single instruction multiple data) instructions in several assignments. These are sets of instructions that operate on wide registers called vectors. For our assignments, these vectors will generally be 256 bits wide, though you may occassionally use the 128-bit versions. Typically, instructions that act on these wide registers will treat it as an array of values. They will then perform operations independently on each value in the array. In hardware, this can be implemented by having multiple ALUs that work in parallel. As a result, although these instructions perform many times more arithmetic than “normal” instructions, they can be as fast as the normal instructions.

Generally, we will be accessing these instructions using “intrinsic functions”. These functions usually correspond directly to a particular assembly instruction. This will allow us to write C code that accesses this special functionality consistently without losing all the benefits of having a C compiler.

Intrinsics reference

The intrinsic functions we will be using are an interface defined by Intel. Consequently, Intel’s documentation, which can be found here is the comprehensive reference for these functions. Note that this documentation includes functions corresponding to instructions which are not supported on lab machines. To avoid seeing these be sure to check only the boxes labelled “AVX”, “AVX2” and “SSE” through “SSE4.2” on the side.

Intel’s reference generally describes the instructions in psuedocode that uses notation like

      a[63:0] := b[127:64]


to represent assigning bits 64 to 127 (inclusive) of a vector b to bits 0 to 63 of a vector a.

Header files

To use the intrinsic functions, you need to include the appropriate header file. For the intrinsics we will be using this is:

      #include <smmintrin.h>
#include <immintrin.h>


Representing vectors in C

To represent 256-bit values that might be stored in one of these registers in C, we will use one of the following types:

Since each of these is just a 256-bit value, you can cast between these types if a function you want to use expects the “wrong” type of value. For example, you might want to use a function meant to load floating values to load integers. Internally, the functions that expect these types just manipulate 256-bit values in registers or memory.

128-bit versions of types and intrinsics

There are also 128-bit vector types and corresponding instructions. To use this, for the most part you can replace __m256 with __m128 in the type names and _mm256_ with _mm_ in the type name.

In some cases, only a 256-bit version of an instruction will exist.

Setting and extracting values

If you want to load a constant in a 128-bit value, you need to use one of the intrinisc functions. Most easily, you can use one of the functions whose name starts with _mm_setr. For example:

      __m256i values = _mm256_setr_epi32(0x1234, 0x2345, 0x3456, 0x4567, 0x5678, 0x6789, 0x789A, 0x89AB);


makes values contain 8 32-bit integers, 0x1234, 0x2345, 0x3456, 0x4567, 0x5678, 0x6789, 0x789A, 0x89AB. We can then extract each of these integers by doing something like:

      int first_value = _mm256_extract_epi32(values, 0);
// first_value == 0x1234
int second_value = _mm256_extract_epi32(values, 1);
// second_value == 0x2345


Note that one may only pass constant indices to the second argument of _mm256_extract_epi32 and similar functions.

Loading and storing values

To load an array of values from memory or store an array of values to memory, we can use the intrinsics starting with _mm256_loadu or _mm256_storeu:

      int arrayA[8];
_mm256_storeu_si256((__m128i*) arrayA, values);
// arrayA[0] == 0x1234
// arrayA[1] == 0x2345
// ...

int arrayB[8] = {10, 20, 30, 40, 50, 60, 70, 80};
values = _mm256_loadu_si256((__m128i*) arrayB);
// 10 == arrayB[0] == _mm256_extract_epi32(values, 0)
// 20 == arrayB[1] == _mm256_extract_epi32(values, 1)
// ...



To actually perform arithmetic on values, there are functions for each of the supported mathematical operations. For example:

      __m256i first_values =  _mm256_setr_epi32(10, 20, 30, 40);
__m256i second_values = _mm256_setr_epi32( 5,  6,  7,  8);
__m256i result_values = _mm256_add_epi32(first_values, second_values);
// _mm_extract_epi32(result_values, 0) == 15
// _mm_extract_epi32(result_values, 1) == 26
// _mm_extract_epi32(result_values, 2) == 37
// _mm_extract_epi32(result_values, 3) == 48


Different types of values in vectors

The examples treat the 256-bit values as an array of 8 32-bit integers. There are instructions that treat in many different types of values, including other sized integers or floating point numbers. You can usually tell which type is expected by the presence of a something indicating the type of value in the function names. For example, “epi32” represents “8 32-bit values” in an __m256 or “4 32-bit values” in an __m128 (The name stands for “extended packed integers, 32-bit”.) Some other conventions in names you will see:

Example (in C)

The following two C functions are equivalent

      int add_no_AVX(int size, int *first_array, int *second_array) {
    for (int i = 0; i < size; ++i) {
        first_array[i] += second_array[i];

int add_AVX(int size, int *first_array, int *second_array) {
    int i = 0;
    for (; i + 8 <= size; i += 8) {
        // load 256-bit chunks of each array
        __m256i first_values = _mm_loadu_si256((__m256i*) &first_array[i]);
        __m256i second_values = _mm_loadu_si256((__m256i*) &second_array[i]);

        // add each pair of 32-bit integers in the 256-bit chunks
        first_values = _mm256_add_epi32(first_values, second_values);
        // store 256-bit chunk to first array
        _mm_storeu_si256((__m256i*) &first_array[i], first_values);
    // handle left-over
    for (; i < size; ++i) {
        first_array[i] += second_array[i];


Selected handy intrinsic functions:



Set constants

Extract parts of values

Convert between types of values

Rearrange 256-bit values

Rearrange 128-bit values

Example (assembly instruction)

The instruction

      paddd %xmm0, %xmm1

takes in two 128-bit values, one in the register %xmm0, and another in the register %xmm1. Each of these registers are treated as an array of two 64-bit values. Each pair of 64-bit values is added together, and the results are stored in %xmm1.

For example, if %xmm0 contains the 128-bit value (written in hexadecimal):

      0x0000 0000 0000 0001 FFFF FFFF FFFF FFFF 


and %xmm1 contains the 128-bit value (written in hexadecimal):

      0xFFFF FFFF FFFF FFFE 0000 0000 0000 0003 


Then %xmm0 would be treated as containing the numbers 1 and -1 (or 0xFFFFFFFFFFFFFFFF), and %xmm1 as containing the numbers -2 and 3. paddd would add 1 and -2 to produce -1 and -1 and 3 to produce 2, so the final value of %xmm1` would be:

      0xFFFF FFFF FFFF FFFF 0000 0000 0000 0002


If we interpret this value as an array of two 64-bit integers, then that would be -1 and 2.