CS 3330 - Draft Text Replacement

© 2016-08-13 Luther Tychonievich

In progress; begun 2016-03-08

1 Warning

This text is in alpha draft mode. At this stage text is added ad-hoc without structure or proof reading. Some of its assertions may be falsehoods or dummy text I have inserted as a temporary placeholder. Do not trust anything this text says without independently verifying it from another, more reliable source.

You have been warned.

2 Systems-level Programming

2.1 C

There are at least three reasons that every software developer should learn C.

  1. C is pervasive. It is used to write some of the largest code bases in the world and some of the most-used tools in existence. If you ever write low-level code, you will almost certainly at least see C.

  2. C is close to, but cleaner than, assembly. There exists a simple, direct mapping between C and assembly, no matter which assembly you are looking at. This makes it a good language for exploring how software and hardware interact.

  3. Almost every other language speaks C. Because C is simple and pervasive, it is hard to find a language that lacks a C interface. If you want to make a library that everyone else can use, write it in C.

There are also reasons that no software developer should use C, notably that the language’s design includes several choices that we have since learned encourage unsafe code. Lack of type safety, for example, is dangerous. D or Rust are probably better choices if you want to write low-level code, but they are neither pervasive nor close to assembly, so we’ll use C.

Aside: the Ternary Operator

Many introductions to programming omit the ternary operator ?:. The expression a ? b : c’s value is the value of b if a is true, otherwise it is the value of c. Thus the ternary operator is an expression-level conditional, as opposed to if which is a statement-level conditional.

2.1.1 C++--

You have learned (some) C++. C++ is almost backwards compatible with C. C has the same primitive types (the integer types char, short, int, and long, in both signed and unsigned variants; the floating-point types float and double; and the pointer type void *) and has arrays, functions, and structs; but it does not have classes or methods or operator overloading.

The lack of classes and operator overloading means most of the C++ libraries you have used do not work in C. There is no new or delete operators, just the functions malloc and free. There is no cout object with its overloaded << operator, just the function printf. Etc.

C is less type-strict than C++. It will implicitly cast all kinds of things for you: pointers and integers can be used interchangeably, for example. This freedom can be confusing, both because it allows looser coding styles than you are used to and because it does not go quite as far as you might expect. For example, under most circumstances int a[] and a int *a are interchangeable (they compile to the same assembly) but for a few operations (like sizeof that are evaluated by the compiler, not the assembler) they behave differently.

C does not have pass-by-reference. If you need by-reference semantics, pass a pointer instead.

We will not spend enough time with C to be worth enumerating or explaining all of its nuances.

2.1.2 -ansi -pedantic-errors

There are many flavors of C. Some of the code we’ll have you write for this class will require a pedantic interpretation of the ANSI specification of the C language. There are two aspects of this standard that experience has taught me trip up C++ programmers:

  1. // ... is not a comment, only /* ... */ is

  2. You cannot mix declarations and code within a function. There must exists a line above which every statement is declaring a variable and below which no statement is declaring a variable. Notably, this means that for(int i = 0; is always wrong; for is not a declaration so int i cannot follow it.

    It is common for C functions to open with a set of declarations, like so:

    int foo() {
        int a,b; float d;
    
        b = 0;
        for(a = 0; a < 10; a += 1)
            b += a * a;
        d = sqrtf(c + b);
        return d*(1+b);
    }

Most C compilers will enforce these constraints if you give them the -ansi -pedantic-errors flags on the command line.

2.1.3 Optimization flags

Most C compilers have multiple levels of optimization available, specified with -O# flags on the command line. A summary of what the most common ones do:

-O0

No optimizations. All variables are stored in memory, program registers are used only as necessary. Tends to create long code with many unnecessary moves.

-O1

Simple optimizations only. Uses registers where possible, may combine a few operations into one assembly instruction if it can do so, but does not do any major reordering of code. Tends to create the easiest assembly to read.

-O2 or simply -O

Optimize for speed. Will reorder operations, remove unreachable code, add temporaries to store common subexpression results, etc. A sensible default for your own code.

-O3

Aggressive optimizations. Some of these make the code significantly longer in hopes of gaining small speedups.

-Os

Optimize for small code footprint. It may run slower but the executable file will be smaller.

2.1.4 Target Architecture

In this course we will see both of the currently-common architectures: x86-64 (also called amd64 or IA64) and one of the current ARM architectures (in particular ARMv8.1-A). To that end, it is useful to know how to cross-compile C; that is, how to run your compiler on one architecture and generate assembly for a different architecture.

The following command will have clang compile foo.c to ARMv8.1-A assembly file foo.s:

clang -target armv8.1a -mhard-float -S foo.c   # 32-bit
clang -target arm64 -S foo.c                   # 64-bit

To cross-compile using gcc requires installing a separate gcc compiler, such as the gcc-arm-none-eabi available using apt-get. ARM is less supported by GCC than it is by LLVM.

2.1.5 32- vs 64-bit

Most ISAs support numbers of various sizes: 1-byte chars and 2-byte shorts and 4-byte longs and 8-byte long longs, for example. However, most support only one size for pointers, and that fixed size has become a common identifier for the architecture as a whole. Thus we have 32-bit architectures (meaning they have 32-bit (or 4-byte) pointers) and 64-bit architectures.

Sometimes it is nice to use size of a pointer integers. C offers a size_t which is a pointer-sized unsigned integer; in most compilers this is built-in but in some you have to #include <stddef.h> to get access to it.

2.1.6 Calling Conventions

When one function calls another, how does it send in the argument values and how does it retrieve the return value when the function completes? The answer is it’s just a convention (i.e., nothing stops you from doing something else), but so pervasive a convention that hardware architectures often specify their suggested approach.

2.1.6.1 Stack-based linkage

One options is arguments and return values go on the stack. We make sure that the first argument is on the top of the stack, the next below it, and so on; this order allows variable-arity functions like printf to determine how many arguments there are based on the contents of the earlier arguments. When returning, we push the return value and then return; the calling function an find it be examining the callees stack addresses.

In practice, this means that to call foo(a,b,c) you push c, then push b, then push a, then jump to the code for foo.

32-bit x86 uses stack-based linkage for arguments. Return values that require more than 4 bytes are placed on the stack and the address to them is returned. Smaller return values use register-based linkage.

64-bit x64 uses stack-based linkage only once it runs out of registers to pass arguments in.

2.1.6.2 Register-based linkage

The other option is to put parameters and return values in registers.

In x86-64, the first six arguments are passed in %rdi, %rsi, %rdx, %rcx, %r8, and %r9; additional parameters spill over into stack-based linkage. In practice, this means that to call foo(a,b,c,d,e,f,g,h) you push h, then push g; and you put a through f into the registers %rdi through %r9; and then you jump to the code for foo.

Return values are handled in a similar way: in x86 (both 32- and 64-bit version); return values are placed in %eax or %rax.

In ARMv8, arguments are passed in r0, r1, r2, and r3, with additional parameters on the stack; and return values are placed in r0.

2.1.7 Memory Layout

From the programmer’s point of view, memory is just a huge array of bytes. We’ll explore a lot about this later, but there are a few items to know now.

2.1.7.1 Alignment

Most memory accesses are for values 4-bytes long or larger. Because of this, memory chip manufacturers have optimized their designs so that if you ask for an address that is a multiple of 4, you get an answer faster than if you ask for an address that is not a multiple of 4.

Not all memory chips have the optimization listed above and even for those that do, the actual performance is more nuanced than described above, but the trend is strong enough that it is typical to align values to a four-byte boundary. Thus something like struct{ char a; int b; char c; } is likely to be 12-bytes, not 6-bytes, in memory.

It is possible to override alignment in C, but rarely wise to do so. Bytes are often cheaper than cycles.

2.1.7.2 Stack, heap, data, text

The average user program uses memory for four kinds of things:

2.1.7.3 Shared libraries, kernel code, padding

There are also three other regions of memory that exist, but are not directly related to code the programmer wrote.

2.1.7.4 Segments

The different regions of memory mentioned above are more than just conceptual ideas; they have software realization in the operating system’s list of segments. Each segment identifies a range of addresses and what operations are permitting on those segments. For example, it is common for the text segment to be executable but not writable, the heap segment to be writable but not executable, etc.

Segments are not a perfect match for the conceptual regions. For example, the data and bss regions may be combined into a single segment, or the data region might be split into two segments: one that is writeable and one that is not. String literals are one example of data that is typically stored in a read-only segment of memory.

Segments are defined by software so that their size and placement can be customized to each program’s needs. However, having software check the segment permissions for every memory address before it is accessed is prohibitively time consuming. Thus, hardware has a mechanism for checking permissions of addresses in a way that only occasionally needs software input. We’ll explore more about this hardware/software interaction with memory later.

kernel shared text, data, bss stack heap unused 0x00..00 0xff..ff

Approximate Layout of memory as implemented on common Linux systems.

Note that both the stack and heap frequently grow and shrink during program execution. The stack in x86 grows down (i.e., push makes the address in the stack pointer smaller); the heap traditionally grows up (i.e, each malloc generally returns a larger address than the malloc before it).

Part of the reason for the different directions is that the shared code region also sometimes grows during execution due to runtime linking; it is advantageous to have the unallocated stack and heap space adjacent to the shared code region to allow that growth.

2.2 Assembly + Debugging

This section assumes you have already learned the basics of assembly code and debugger usage. It is intended as a review, not a first exposure.

2.2.1 ISA vs Assembly

For much of this course we will be dealing with ISAs (Instruction Set Architectures), meaning the design of the processor at the level of what instructions the processor understands, along with the encoding and semantics of each instruction.

Assembly languages are mostly just mnemonic wrappers around instructions in an instruction set, using abbreviations (like pop %r14) instead of the byte values those abbreviations represent (like 41 5e).

Assembly code offers one feature absent from assembled instructions: labels. In the ISA you can’t say jump to foo, only jump to address 0x123, and addresses tend to change during coding so assembly lets you define labels like foo: to say I’m calling the address of the next thing, wherever it ends up, foo and lets you use label names anywhere the ISA expects an immediate value (e.g., jmp foo).

The previous sentence also raises a vocabulary item: the concept you learned to call literal values in programming is (almost) the same as what assembly and machine code calls immediate values: if a value appears directly in the assembly or instruction encoding, it is an immediate value.

2.2.2 Common Assemblies

Assembly languages, like most parts of computing, is subject to changing fashions and technologies. At the time this document was written (a.d. 2016) there are three main assemblies in use.

We’ll compare how each of the assemblies represents the foo function in the following file:

char c = 0;
int i = 2;
long l[8];

void foo() {
    c += i;
    l[i] += c;
}

2.2.2.1 x86 Intel-syntax

This is the default in Microsoft toolchains.

Intel-syntax x86 and x86-64 assembly uses the format operation destination, source. It uses the same operation name for all sizes (e.g., mov is used for 8-, 16-, 32- and 64-bit moves). Register names are given in raw form; memory access uses size PTR address format.

example:

foo:
    mov eax, DWORD PTR i[rip]
    mov edx, eax
    add dl, BYTE PTR c[rip]
    mov BYTE PTR c[rip], dl
    cdqe
    movsx   rdx, dl
    add QWORD PTR l[0+rax*8], rdx
    ret

Information is typically quite dense in assembly; each letter might mean something. For example, the instruction movsx above means move, sign-extending. How much to sign-extend is determined from the operand sizes: from a 1-byte register (dl) to an 8-byte register (rdx).

In Intel syntax, the following names are used for different operand sizes:

bytes English assembly register
1 byte BYTE al
2 word WORD ax
4 double word DWORD eax
8 quad word QWORD rax

2.2.2.2 x86 AT&T-syntax

This is the default in GNU toolchains.

ATT-syntax x86 and x86-64 uses the format operation source, destination. It uses single-letter suffixes on operations to indicate the size of the operands. Register names are prefixed by %, immediates by $; memory is indicated using ().

example:

foo:
    movl    i(%rip), %eax
    movl    %eax, %edx
    addb    c(%rip), %dl
    movb    %dl, c(%rip)
    cltq
    movsbq  %dl, %rdx
    addq    %rdx, l(,%rax,8)
    ret

Information is typically quite dense in assembly; each letter might mean something. For example, the instruction movsbq above means move, sign-extending, from byte to quad-word. The size in the operation is also given by the size in the operands.

In AT&T syntax, the following names are used for different operand sizes:

bytes English assembly register
1 byte b %al
2 word w %ax
4 long word l %eax
8 quad word q %rax

2.2.2.3 ARM

There are many ARM assemblies; common as of 2016 are ARMv7 and ARMv8, the latter having both 32- and 64-bit versions. ARM assembly has several differences from x86-64, including

Size of operation in ARM is given only by (optional) size suffixes on the operands (no suffix means to use the default width for the architecture); in 32-bit ARM registers (mostly) named r (e.g., r0); in 64-bit ARM, 32-bit registers are named w and 64-bit x (e.g., w0 and x0). Both variants use 32-bit registers for 16- and 8-bit values as well.

32-bit example:

foo:
    ldr   r0, .i
    ldr   r1, .c
    ldr   r0, [r0]
    ldrb  r2, [r1]
    add   r2, r2, r0
    strb  r2, [r1]
    ldr   r1, .l
    and   r2, r2, #255
    ldr   r3, [r1, r0, lsl #2]
    add   r2, r2, r3
    str   r2, [r1, r0, lsl #2]
    bx    lr

64-bit example:

foo:
    adrp  x8, i
    ldrsw x8, [x8, :lo12:i]
    adrp  x9, c
    ldrb  w10, [x9, :lo12:c]
    adrp  x11, l
    add   x11, x11, :lo12:l
    ldr   x12, [x11, x8, lsl #3]
    add   w10, w10, w8
    strb  w10, [x9, :lo12:c]
    and   w9, w10, #0xff
    add   x9, x12, w9, uxtw
    str   x9, [x11, x8, lsl #3]
    ret

It is worth noting that the two languages are significantly different; while x86-64 is a superset of x86 with a few extra registers and instructions, AArch 64 is a redesign compared to AArch 32. Also notice the two-step address computation in AArch 64, needed because addresses are twice as long as instructions in that architecture.

2.2.2.4 LLVM

The LLVM family of compilers is gaining popularity and is becoming the default on some platforms. The design of LLVM (an acronym for Low-Level Virtual Machine) is based around an assembly-like intermediate language. LLVM intermediate language assembly is unlike machine architecture assembly in several ways, notably that it includes some typing information and has an unlimited number of registers. You will likely never see LLVM pseudo-assembly in the wild, but in case you are interested I’ve included an example of what it looks like here:

define void @foo() #0 {
  %1 = load i32* @i, align 4, !tbaa !1
  %2 = load i8* @c, align 1, !tbaa !5
  %3 = zext i8 %2 to i32
  %4 = add nsw i32 %3, %1
  %5 = trunc i32 %4 to i8
  store i8 %5, i8* @c, align 1, !tbaa !5
  %6 = sext i8 %5 to i64
  %7 = load i32* @i, align 4, !tbaa !1
  %8 = sext i32 %7 to i64
  %9 = getelementptr inbounds [8 x i64]* @l, i64 0, i64 %8
  %10 = load i64* %9, align 8, !tbaa !6
  %11 = add nsw i64 %6, %10
  store i64 %11, i64* %9, align 8, !tbaa !6
  ret void
}

2.2.3 Low-level Debugging

2.2.3.1 gdb and lldb

2.2.3.2 objdump

2.2.3.3

2.3 Data Revisited

2.3.1 Numbers

This section assumes you have already learned the basics binary represenation of numbers and floating-point representations. It is intended as a review, not a first exposure.

2.3.1.1 Binary, Hexadecimal, and Octets

Processor hardware operates in base-2 or binary. It does this because it takes less space and power to distinguish between two voltage states (high and low) than between three or more. Anything we might gain from 3- or higher-state logic we can also gain by using several bits together.

There are two common cluster-of-bits sizes that are also used: the 4-bit nibble or nybble and the 8-bit byte or octet.

2.3.1.1.1 Binary

Digits are taken from the set of bits: {0, 1}. Place-value notation is used with base 2. Thus the number 1101000000102 (or 0b110100000010) represents the number

211 210 29 28 27 26 25 24 23 22 21 20
1 1 0 1 0 0 0 0 0 0 1 0

211 + 210 + 28 + 21 = 2048 + 1024 + 256 + 2 = 3330

Math works the same in base-2 as it does in base-10:

            1      11      11    1 11    1 11 
  1011    1011    1011    1011    1011    1011
+ 1001  + 1001  + 1001  + 1001  + 1001  + 1001
------  ------  ------  ------  ------  ------
             0      00     100    0100   10100
2.3.1.1.2 Hexadecimal

Digits are taken from the set of nibbles: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f}. Place-value notation is used with base 16. Thus the number d0216 (or 0xd02) represents the number

162 161 160
d 0 2

d × 162 + 1 × 160 = 13×256 + 2×1 = 3328 + 2 = 3330

Hexadecimal is particularly noteworthy because it is easy to convert to and from binary: each nibble is simply 4 bits. Thus

d 0 2
1101 0000 0010

Hexidecimal is commonly used by humans who want to communicate in binary without getting lost in the middle of a long string of 0s and 1s.

Math works the same in base-16 as it does in base-10:

            1       1       1    1  1    1  1 
  d0b2    d0b2    d0b2    d0b2    d0b2    d0b2
+ 300f  + 300f  + 300f  + 300f  + 300f  + 300f
------  ------  ------  ------  ------  ------
             1      c1     0c1    00c1   100c1
2.3.1.1.3 Octets

1 octet = 2 nibbles = 8 bits. Octets are typically written using pairs of nibbles directly; thus the digits are taken from the set of bytes: {00, 01, 02, … fd, fe, ff}. I have never yet seen a human do math directly in base-256.

Octets are noteworthy because, while processor mathematical circuits use base-2, most of the rest of computing is based on octets instead. Memory, disk, network, datatypes, assembly—all used bytes, not bits, as the smallest unit of communication.

2.3.1.1.4 Base-2 logs and exponents

Power-of-two numbers show up a lot in hardware and hardware-interfacing software. It is worth learning the vocabulary used to express them.

Value base-10 Short form Pronounced
210 1024 Ki Kilo
220 1,048,576 Mi Mega
230 1,073,741,824 Gi Giga
240 1,099,511,627,776 Ti Tera
250 1,125,899,906,842,624 Pi Peta
260 1,152,921,504,606,846,976 Ei Exa

In all cases above, the i is usually dropped to just say (e.g.) G instead of Gi. The i clarifies that we mean the base-2 power, not the base-10 power. G could mean either 230 or 109, numbers the differ by about 7%; which one is meant can only be guessed from context unless the clarifying i is present (which it usually is not).

Only the first three rows in the table above (K, M, and G) will show up very often in this course. We’ll never ask you to know the contents of the base-10 column during this course.

227 = 27 220 = 128M. This pattern works for any power of 2: the 1’s digit of the exponent becomes a number, the 10’s digit becomes a letter. Thus

Value Split Written
227 27 220 128M
23 23 20 8
239 29 230 512G

If you memorize the values of 20 through 29, you can then do these exponentiations in your head. That is worth memorizing; these things show up too often to be worth wasting thought on all the time; memorize them now and you’ll have less cognitive load in the future.

Logarithms with base 2 (written log2(n) or lg(n) or sometimes just log(n)) do the inverse of this: lg(64G) = lg(26 230) = lg(236) = 36. Again, these show up so often that you should be able to do them automatically.

Practice

Fill in the rest of the following table.

Exponent Written As
17 128K
3
38
11
256M
16G
32

2.3.1.2 Three kinds of Negatives

How should negative numbers be represented in hardware? Three answers are often taught in courses like this, and three are actually used, but they are not the same three. We’ll explore each in terms of base-10 numbers first, then see how they map to binary.

All of these solutions depend on access to a finite number of bits/digits/nibbles/bytes to store numbers in. Our examples will assume we have four digits.

Two’s Compliment

This version picks a number (typically half of the maximum number we can write, rounded up) and decides that that number and all numbers bigger than it are negative. How negative? To understand that, we need to observe a ring-like behavior of numbers.

What is 0000 - 0001 ? The answer is based on place-value subtraction’s notion of borrowing: 0 - 1 is 9 borrow 1. We end up borrowing from all spots to get 0000 - 0001 = 9999 with a stray borrow overflowing and getting lost due to finite precision. Since 0000 - 0001 is obviously negative 1, we set 9999 to be negative 1. One less than negative 1 is negative 2 = 9998, and so on.

Two’s compliment is nice because the three most common mathematical operators (addition, subtraction, and multiplication) work the same for signed and unsigned values. Division is messier, but division is always messy.

Two’s compliment is used for signed integers.

0000 0 0001 +1 0010 +2 0011 +3 0100 +4 0101 +5 0110 +6 0111 +7 1000 −8 1001 −7 1010 −6 1011 −5 1100 −4 1101 −3 1110 −2 1111 −1

Visualization of two’s compliment numbers in binary.

Zero is all zero bits. Adding and subtracting 1 makes numbers one larger or smaller, except at the transition between 01...1 and 10...0.

There is one more negative number than there are positive numbers.

Biased

This version picks a magical fixed number (typically half of the maximum number we can write, rounded down) and calls it the bias. For our numbers (0000 through 9999) that bias would be 4999. We then define the meaning of number x to be the value of xbias. Thus seven is written 5006 and negative seven is written 4992.

Biased numbers are nice in that addition and subtraction and comparison operators (< and > and their friends) work exactly like they do for unsigned numbers. However, multiplication and division are messier and humans typically have some difficulty seeing the value in the representation.

Biased numbers are used for the exponent of floating-point numbers.

0000 −7 0001 −6 0010 −5 0011 −4 0100 −3 0101 −2 0110 −1 0111 0 1000 +1 1001 +2 1010 +3 1011 +4 1100 +5 1101 +6 1110 +7 1111 +8

Visualization of biased numbers in binary.

Zero is 01...1. Adding and subtracting 1 makes numbers one larger or smaller, except at the transition between 1...1 and 0...0.

There is one more positive number than there are negative numbers.

Sign bit

This is the version we all learned in grade school. Negative seven is written exactly like 7, but with a special symbol in the most-significant digits place: -7. With our fixed-width constraint, we’d write that as -007 or +007.

The main downside to sign-bit values is that they have two zeros: -000 and +000. These two are written differently but have the same numerical value, so should -000 == +000 be true or false? We can make any decision we want, of course, but ambiguity is not popular.

Sign bits are used in floating-point numbers.

0000 0 0001 +1 0010 +2 0011 +3 0100 +4 0101 +5 0110 +6 0111 +7 1000 −0 1001 −1 1010 −2 1011 −3 1100 −4 1101 −5 1110 −6 1111 −7

Visualization of sign-bit numbers in binary.

There are two zeros and two discontinuities.

The negative numbers move backwards, increasing in value when the unsigned representation decreases.

Sign-bit integers, as displayed here, are not used in any common hardware.

There is also a representation called Ones’ compliment that is often taught but that is not used by common desktop, laptop, tablet, or phone hardware.

In C and most modern hardware,

Sign Bit

is used only with floating-point numbers and is in the highest-order-bit’s position. 0 means + and 1 means −.

Biased

is used only with the exponent of floating-point numbers. The bias always has the form 011...11.

Two’s Compliment

is used for all signed integer types (signed char, short, int, long). If the high-order bit is set, the number is negative; if not, it is non-negative. Negating a signed integer can be done by flipping all the bits and adding 1 to the result.

Practice

Fill in the rest of the following table. Assume you are using 6-bit numbers.

Decimal Two’s-C Biassed
5 000101 100100
−5 111011 011010
11
−1
110011
011111
101111
010000

2.3.1.3 IEEE-style Floating Point

A floating-point number consists of three parts:

  1. A sign bit; 0 for positive, 1 for negative. Always present, always in the highest-order bit’s place, always a single bit.

  2. An exponent, represented as a biased integer. Always appears between the sign bit and the fraction.

    If the exponent is either all 1 bits or all 0 bits, it means the number being represented is unusual and the normal rules for floating-point numbers do not apply.

  3. A fraction, represented as a sequence of bits. Always occupies the low-order bits of the number. If the exponent suggested this was not the normal-case number, may have special meaning.

There are four cases for a floating-point number:

Normalized

The exponent bits are neither all 0 nor all 1.

The number represented by s eeee ffff is ± 1.ffff × 2eeee − bias. The value 1.ffff is called the mantissa.

Denormalized

The exponent bits are all 0.

The number represented by s 0000 ffff is ± 0.ffff × 21 − bias. The value 0.ffff is called the mantissa. Note that the exponent used is 1 − bias not 0 − bias.

Infinity

The exponent bits are all 1 and the fraction bits are all 0.

The meaning of s 1111 0000 is ±∞.

Not a Number

The exponent bits are all 1 and the fraction bits are not all 0.

The value s 1111 ffff is Not a Number, or NaN. There is meaning to the fraction bits, but we will explore it in this course.

Practice

Complete the following table. Assume 8-bit floating-point numbers with 3 fraction bits.

Fraction Binary Binary Scientific Bits
7 / 8 0.111 1.11 × 2−1 0 0110 110
−∞ 1 1111 000
1 × 2
−11110000 × 2
× 2 0 0001 000
× 2 0 0000 111
1 × 2−9
× 2 0 1111 111
× 2 1 0000 000

2.3.2 Aggregates

There are two ways to aggregate types into larger types. You can put multiple values adjacent to one another in memory, or you can store pointers from one value to another.

2.3.2.1 Bytes and Endianness

Processors handle numbers in binary. A binary number has a high-order bit and a low-order bit, with many bits in between. Sometimes we think of the high-order bit as first (for example, when we write it we write the high-order bit first) and sometimes as last (for example, when we do math we get to the high-order bit last).

Memory handles numbers in bytes, base-256. A byte-string number has a high-order byte and a low-order byte, generally with several bytes in between. Again, first and last are not stable ideas when it comes to numbers.

Memory also handles everything else as bytes, which means it does identify the first and last byte in any byte-string. Which means memory can tell us which byte of a number is first: the answer is not a property of the number, just a property of memory layout. The layout is chosen by the processor when it reads or writes bytes in memory.

x86 and x86-64 have decided that the low-order byte gets the low-order address and is thus first in memory. Network protocols have decided that the high-order byte goes over the wire first and is thus first. Other processors and protocols have made their own decisions in these issues. ARM notably can be configured either way.

If the high-order (big value) byte is first, we call the system Big-endian. If the low-order (little value) byte is first, we call the system Little-endian. Both apply only when writing single multi-byte values to memory.

Aside: Arabic Numerals
In English (and most other languages today) we use base-10 and Arabic numerals; thus thirty-five is written 35. The high-order digit is on the left, the low-order digit is on the right. When you put Arabic numbers inside of English text it thus looks big-endian because English is read left-to-right. However, when you put the same number inside of Arabic text it looks little-endian because Arabic is read right-to-left.

Example

Consider the array short x[2] = { 0x1234, 0x5678 };. If &x[0] == 0x108, what byte is in memory in the following locations?

Endianness 107 108 109 10a 10b 10c
Little ? 34 12 78 56 ?
Big ? 12 34 56 78 ?

Observe that only the order of bytes within a number changes based on endianness; the following are all the same in both:

2.3.2.2 Structs and Packing

In C, a struct is a plain-old-data or POD type, meaning it acts (to the degree possible) like a primitive type: passed by value in registers or on the stack as the compiler choses. If you force it to be placed in memory (either by using malloc or by requesting its address) a struct will be aligned, which means that it may take more space than its component parts would seem to require.

Most dialects of C have mechanisms for packed structs which bypass the alignment restrictions. Some of these even let you pack multiple values into a single byte.

To force the compiler to align on one-byte boundaries, use a set of pre-processing #pragmas:

#pragma pack(push)
#pragma pack(1)
struct thing {
    char one;
    int four;
    char one;
};
#pragma pack(pop)

This will make the struct smaller (just 6 bytes long), but also (most likely) slower.

To have sub-byte elements in a struct, use : after the field name:

struct tiny {
    char bit:1;
    char nibble:4;
    short rest:13;
};

This will result in a 2-byte struct where the fields are stored in parts of the bytes. Called bitfields, these are a dense way of storing information but are much slower than ordinary structs because the compiler has to insert various shifts and masks to get at the part of each field being accessed each time it is accessed.

Because structs are treated like primitives, it is worth thinking about how you use them. Because they can be placed in registers, they can be faster than pointed-to data; but because they are passed by value, there is significant overhead to passing large structs to and from functions. Because of the function overhead, it is common to allocate larger structs on the heap using malloc and pass them by pointer, not by value.

2.3.2.3 Arrays and Pointers

An array is simply multiple values of the same type held in contiguous memory locations. The value of an array is the address of its first element. Although C does treat int x[] and int *x different in a few cases, they are mostly interchangeable.

The syntax x[i] means exactly the same thing as the syntax *(x+i), that is the value in memory at address x + i × sizeof(*x). Note that C does the sizeof-computation for you whether you type x[i] or *(x+i). One side effect of that automatic offsetting by the size of the element is that ((char*)x) + 2 is a different address than is ((int*)x) + 2.

A side effect of the value of an array being the address of its first element is that you cannot directly pass an array by its content’s values. struct and primitive types are passed by value, but arrays are passed by address.

2.3.2.4 Case Study: 3 kinds of lists

An important consideration when using lists is how to determine how many elements the list contains. This is interesting in part because the decision made by C is not the decision made by most more recent languages.

Let us consider four options:

There are other kinds of lists, but these are a reasonable sampling. Lists whose size can be determined at compile time are of limited interest, so this section devotes particular attention to the other three.

2.3.2.4.1 Length Array

A length array of TYPE values, also called a dynamic array or an array list, can be stored in C as

typedef struct { size_t length; TYPE *ptr; } lenarr;

Length arrays have the following properties:

2.3.2.4.2 Sentinel Array

The end of a sentinel array is denoted by a special value of the type stored in the array. The most common sentinel array in C is the native string type: "hi" is stored in memory as the array {104, 105, 0} where 104 is the ASCII code for h and 0 (or \0) is the null byte, the designated sentinel value for strings.

A sentinel array of TYPE values is just a TYPE *, but it requires a known sentinel that is guaranteed not to occur in the list except as the sentinel. Besides strings, the most common sentinel arrays store pointers and use a NULL pointer as a sentinel.

Sentinel arrays have the following properties:

2.3.2.4.3 Linked List

A (singly) linked list of TYPE values can be stored in C using nodes defined as

typedef struct { TYPE payload; node *next; } node;

A linked list is then passed as a node *.

Linked lists have the following properties:

2.3.3 Code

In addition to data, we need to be able to encode code as bytes in memory too. There are many ways to do this, both because new hardware enables new operations that need new encoding and because of the burden that is backwards compatibility.

2.3.3.1 Variable-width or Fixed-width?

In ARM assembly, every instruction is 4 bytes long. In x86-64, some are 1 byte and some more than a dozen bytes long.

With fixed-width instructions like ARM, it is easy to know where an instruction will be, simplifying instruction memory hardware. Fixed width instructions are typically also use simple patterns of bits internally to mean the same things in each instruction, meaning less hardware is needed to parse the instructions and determine what they are asking the processor to do. On the flip-side, there are limitations in how much information you can put in an instruction; notably, this means that some simple instructions like x = 0x12345678; might not be encodable as a single instruction due to the inability to store the action (setting a register value) and the register and the 32-bit constant all in a single 4-byte value.

One of the best-selling books on hardware has made a big deal out of fixed-width being better than variable-width, but it is not black-and-white.

2.3.3.2 Many or Few?

How many instructions should exist? Obviously we want to be able to request every operation we expect to see (division, storing to memory, etc.) but we also don’t want there to be so many instructions that we have to make hardware to support instructions that very rarely get executed…

This argument seems to have been pretty decidedly won: more instructions is better. Both ARMv8 and x86-64 have more than a thousand distinct instructions. Many textbooks still talk about large instruction sets as bad, but the success of processors shows this argument to be in error. If you are designing your own low-resource embedded processor you should probably go for simplicity, but those who make processors for hundreds of millions of devices do not do so.

2.3.3.3 RISC vs CISC

Around 1980 some researchers at Stanford and Berkeley coined a couple of acronyms that have caught on, despite not describing any real hardware. The first was RISC (Reduced Instruction Set Computing), which is what they called their ideal systems; the other was CISC (Complex Instruction Set Computing), meaning that set of things that they they didn’t like about commercial hardware of their time.

The usual breakdown of the two terms is as follows; I’ve also added a winner column for which option prevails in currently (2016) popular hardware.

Topic RISC CISC Winner
Number of instructions < 100 > 200 CISC
Number of program registers ≥ 32 ≤ 8 RISC?
Program registers all general-purpose some have an assigned purpose CISC
Cycles per instruction 1 varies by instruction CISC
Bytes per instruction all the same (usually 4) varies by instruction ?
Immediate Values limited (not enough bits to represent all) unlimited ?
Memory addressing modes 1 register, 1 immediate 2 registers, 1 immediate, 1 shift CISC?
Mathematical operands both must be registers ≤1 may be memory ?
Code limitations some instructions may not follow others no constraints CISC
Conditional operations boolean operands condition-code registers CISC
Linkage convention register-based stack-based RISC

There are also architectures that do have significant design decisions beyond those in this table, such as the compiler-controlled instruction scheduling of the Itanium architecture, but they are beyond the scope of this course.

2.3.3.4 Examples

TODO: add this part

3 Hardware Design

3.1 Hardware Components

Processor hardware is built out of three main conceptual pieces: wires, gates, and registers. These behave in some ways much like variables, operators, and data structures but in other ways they are distinct. Understanding their semantics is key to understanding hardware design.

3.1.1 Current, Voltage, Power, and Heat

Ignoring many of the nuances of electricity and speaking only in broad generalities,

Current

is the rate at which electrons flow. In practice it is not always electrons that are the conceptual charge carriers and amperage does not map directly to electrons/second, but that is the general idea.

If electrical state changes, current is involved.

Voltage

is the desire electrons have to flow; the electrical pressure, if you will. Voltage does not always cause current because it contends against resistance: not all substances conduct electricity very easily.

Most elements of computer hardware are sensitive to voltage. The 1s are high voltage and 0s are low voltage. How high depends on the underlying design of the components, but chips with highs of around 1 volt are common today (2016). Low voltage is generally close to 0 volts.

Power

is current × voltage. It does not require power to maintain voltage without current, nor to maintain current without voltage.

Since chip state is voltage and change in state requires current, every time that a processor changes state it uses power. In practice some power is used even between state changes too, since it is hard to maintain voltage in a changeable system without using some current leakage.

Heat

is (by the law of entropy) an inevitable byproduct of power. It is also a major concern for chip manufacturers. Chips are so small and the power they use so concentrated that even a small amount of heat can result in a large temperature change. Electrical properties tend to change with temperature, and notably we have trouble making components that function at higher temperatures.

It is thus vital that chip temperature be kept low enough that the chip continues to function. But to reduce heat we must reduce power; and to reduce power we either need to reduce voltage (meaning more sensitive and harder-to-engineer components) or reduce current (meaning less state change and thus a reduction in chip speed).

In this course we will not explore designing for limited power usage, but the outline given above is useful to be conversant in hardware design issues. It also explains why tablets tend to be slower than servers: they have less power to draw from (a battery instead of a connection to the power plant) and less space to devote to heat dissipation (heat sinks, fans, and so on) so they cannot change as much voltage as quickly as the servers can.

3.1.2 Wires, Gates, and Muxes

3.1.2.1 Basic Components

Wires

are physical paths of conductive material, often copper. Wires have very little resistance, meaning it takes almost no current for the voltage across a wire to equalize. It is safe to assume that everything connected to a wire has the same voltage.

Typically wires are connected between one source or driver of voltage and one or more sinks or recipients of voltage. They act in some ways like assignment operators (sink = source) and in some ways like temporary variables

tmp = source; sink1 = tmp; sink2 = tmp; ...
Gates

are particular arrangements of transistors that perform certain fixed functions. The most common are the boolean gates that take one or two bits in and produce one bit out; examples include and, or, xor, not, nand, etc.

Gates operate continuously, in parallel with all other gates, changing as soon as voltage changes on their input wires. Because they operate with finite current, it takes a non-zero amount of time for the output to reflect the change in the input.

The design of individual gates is outside the scope of this class.

Multiplexers (or Muxes)

are arrangements of gates that act as selectors. The simplest mux implements the same operation as the ternary operator in C: output = selector ? input1 : input0. Although the physical arrangement of gates and wires has three inputs (selector, input0, and input1) we typically call only two of them the inputs (input0 and input1).

Being made out of gates, muxes operate continuously, in parallel with all other gates, changing as soon as voltage changes on their input wires, with some finite delay between a change of input and the corresponding change of output.

Muxes can easily be extended to have more thant two inputs, provided additional selectors are also provided. They always have a single output which is selected from the set of inputs.

A significant part of hardware design at the level we will do it is made out of muxes.

Muxes are traditionally drawn as a trapezoid with the inputs on the long side, the output centered on the short side, and the selectors entering one of the angled sides.

3.1.2.2 Multi-bit components

We will almost never refer to the basic bit-level components listed above again. Instead, we will focus on multi-bit extensions.

3.1.2.2.1 Multi-bit implementation

A multi-bit extension of a wire is just a cluster or wires run next to each other. Eight wires running together can transmit eight bits of information in their voltages. We’ll generally call such a group of wires just a wire or sometimes an 8-bit-wide wire, a wire carrying an 8-bit value, or an 8-bit wire.

A multi-bit mux has multi-bit inputs and output, all the bits of which share a single selector. They can be implemented as a separate 1-bit mux for each bit of output:

out_0 = selector ? in1_0 : in0_0;
out_1 = selector ? in1_1 : in0_1;
out_2 = selector ? in1_2 : in0_2;
...

We’ll use C-like notation for multi-bit gates and other components that emulate C operations like adders. Thus & will mean a bit-wise and, && a logical and, + an adder circuit, and so on.

3.1.2.2.2 Transitioning multi-bit values

It is almost never the case that all of the gates on all of the individual wires of a multi-bit component produce output at exactly the same speed. This means that during the transition time, the outputs of logical components cannot be trusted to be any valid value.

Suppose I have an adder with inputs 0b0001 and 0b0110; the adder is currently producing output 0b0111. Now I change the second input to 0b0111. Eventually the output will change to 0b1000, meaning all four bits change. However, I cannot in general predict the order in which the bits will change; if (for example) the two low order bits change first then there will be a moment in time when the adder is outputting 0b0100. But 0b0100 is not the correct output for either the old or the new inputs.

Not all orders of changes are possible, but it is usually the case that the outputs of a logical component goes through a period of incorrect values during each transition. It is important, then, to design circuits such that these transitional outputs are ignored and not committed to the programmer-visible state of the processor.

3.1.3 Registers

Consider the following circuit:

+ 0001 x

As we stated in the previous section, the different bits of the output of the adder appear at different times. This means that if we start with x = 0 we might see a sequence of x values like 0, 1, 0, 2, 1, …: in other words, not the sequence we want to see. To fix this we add a register to the circuit:

+ 0001 x R

Registers have one input and one output, as well as a second single-bit input called the clock (drawn at the top of R in the image above; the triangle is a traditional marker for a clock input). They store a value inside them, and most of the time they completely ignore their input and output that internal value. However, at the moment when the clock input transitions from low- to high-voltage (and only at that moment) they read their input into their internal stored value.

The clock inputs of all registers across the entire chip are (conceptually) attached to the same clock, a oscillator that outputs low-voltage, then high-voltage, then low-voltage, then high-voltage, …, at predictable intervals.

Adding the register solves the unpredictable output problem. The current value of R and 1 are sent into the adder; its output jumps about a bit but stabilizes at the value R + 1; then the clock changes from 0 to 1 (called the rising clock edge) and the new value is put into R. Once the new value is in it is output and the adder output starts jumping around again, but that if fine because the register will ignore it until the next rising clock edge.

3.1.3.1 Clock Speed

The clock is important enough to have become one of the most noted stats about chips: a 2GHz clock speed means the clock oscillator completes 2 billion (2×109 not 2×230) low-high cycles per second. Because virtually every part of a chip contains registers, and because all registers are attached to the clock, clock speed is a direct indicator of overall processor speed.

So, why not just add a faster oscillator to make the chip faster? There are two main reasons:

Some people chose to overclock their chips. Roughly, they take a chip that was designed to run at one speed and they attach a faster clock to it. Since the shipped clock speed is generally a conservative estimate, sometimes this just works and the processor becomes faster at no extra cost. But sometimes the new clock speed is too fast and the computer overheats and/or fails to compute correctly. Worse yet, often the problem is not immediately manifest, perhaps only overheating after months of use or only making an error in computation on one out of every billion address computations.

3.2 Processors

3.2.1 Steps in executing an instruction

What are the steps involved in a processor executing an instruction? The basic outline is as follows:

  1. Send an address to memory, get an instruction (encoded in bytes) back.
  2. Parse the action being requested by those bytes.
  3. If the instruction has program register operands, read the values of those program registers.
  4. If the instruction does some mathematical or logical operation, perform that work.
  5. If the instruction reads from or writes to memory, do that.
  6. If the instruction updates program register values, write those values into the program registers.
  7. Repeat with the next instruction, either the next in the instruction stream or the one the instruction being executed tells us to jump to.

There are traditional names for these steps, but they are not necessarily the most obvious names. Further, the set of names is smaller than the set of steps, and the allocation of tasks to the various steps is not uniform. We’ll do our best to both explain the steps clearly and indicate common terminology where possible.

3.2.1.1 Fetch-Decode-Execute

Almost all hardware design texts that I have encountered refer to the following steps:

Some also add a few more steps, such as

Despite the commonality of the first three names, there is not an agreed-upon set of work for each. Nonetheless, the names are so common that it is worth knowing approximately what people mean when they use them.

3.2.1.1.1 Fetch

Fetch consists of

3.2.1.1.2 Decode

Decode is listed as the second step of computation in every architecture text I have read, but the intersection of the work attributed to the decode steps of the texts I am familiar with is empty.

One view of decode is to parse out the bits and pieces of the instruction’s bytes, and wire them to the appropriate logic for the next step.

Another view of decode stays the parsing happened during fetch and decode’s job is to retrieve the value stored in the program register operands used by the instruction.

3.2.1.1.3 Execute

Execute always includes any math or logic operations the instructions requested. It typically also includes deciding if a conditional jump should be taken or not.

Sometimes retrieving the values stored in program registers is characterized as part of Execute, sometimes as part of Decode.

Sometimes retrieving values from or sending values to memory is characterized as part of Execute, sometimes as its own Memory step.

Sometimes writing computed or retrieved values into program registers is characterized as part of execute, sometimes as its own Writeback step.

3.2.1.2 Actions and Dependencies

Part of the reason for there being several groupings of steps in executing an instruction is that the ordering is only partially informed by the hardware. Hardware runs in parallel unless you use registers to force it not to do so, so only actual dependencies between steps cause things to have order. Many linear orderings are generally compatible with a single operation.

Take, for example, the x86-86 push %rcx command. To execute this we have to read it from memory, parse out what it wants us to do, get the values of %rcx and %rsp from the program register file, decrement %rsp by 8, write to memory at that decremented value, store the decremented value back into %rsp, and compute the address of the next instruction in the program. The order we do these in is partially constrained, but only partially; a dependency graph might look like this:

Send PC to instruction memory Identify the instruction Parse out operands Compute next PC Read program registers Write stack pointer to register Send next PC to PC register Compute stack pointer Write memory at stack address

In general each instruction has its own graph of this form, and trying to display them all at once is not really feasible. The goal of ordered steps is to say if you do things in this order, all of the instructions will work out.

3.2.1.3 Muxes to the rescue

We generally want to avoid having separate hardware for each instruction. Fortunately, muxes let us share logic nicely.

Consider, for example, an adder circuit. Many instructions add: push and pop add to the stack pointer (reg1 + ±8), memory operands often involve adding a register and immediate value together (reg1 + immediate), we add when doing math (reg1 + reg2), and so on. But for any given instruction, only one of those things is happening, so we can use just one adder. For example, if we determine that we always add a register value to something, we might end up with a circuit like this:

+ +8 −8 reg1 reg2 immediate instruction

There are relatively few component actions that a processor carries out; much of the processor design that we will look at will be combining those components with muxes to implement various instructions.

3.2.2 HDLs

Hardware Decription Languages (HDLs) are prevalent throughout hardware design. Two are particularly common: VHDL (VHSIC Hardware Description Language; VHSIC is stands for Very High Speed Integrated Circuit) and Verilog (whose name suggests it’s original purpose, supporting the formal verification of digital logic circuit designs). Both can seem quite verbose at the level of detail we will be working, having extensive support for modeling lower-level details like how adders work as well as ancillary material like how memory operates and how registers react to the clock.

TODO verify the historicity of verilog

We will use a relatively small subset of Verilog in this course, with a few add-ons to simplify common tasks like defining a clocked register.

TODO Verify that I want to do that… I have HCL2D working almost like I want it (some issue with using a built-in output inside a mux); do I want to roll out something new? Teach them enough Verilog to be dangerous?

3.2.2.1 Basic Verilog

Literals

Verilog has two kinds of literal values: sized and unsized.

Apostrophe literals (with or without size) may include underscores as visual separators, which are ignored by verilog. Thus 'b1101_0000_0010, 'h__d_02, and 'd3330 are all the same value.

Mux
wire o = s ? i1 : i2;

There are constraints on where muxes can be defined; we’ll generally restrict them to being an initializer for a wire or being one of the values of another mux.

Wire
wire[63:0] name;

The values in the brackets are the legal bit-indices; the first number is the index of the highest-order bit, the last of the lowest-order bit.

D-style flip-flop
reg[63:0] r; 
wire[63:0] r_in; 
always @(posedge clk) r = r_in;

This is the building-block of most registers: whenever the clock (which needs to have been defined elsewhere) has a positive edge (i.e., goes from 0 to 1) the input is stored into the register.

The output of a flip-flop is just the name of the reg (r in this example). The use of a reg without the corresponding always is beyond the scope of this course.

Pipeline register
reg[63:0] F_pc;  // storage and output signal
wire[63:0] x_pc; // input signal

wire stall_F;    // must be 0 for input to be accepted
wire[63:0] default_pc = 0;
wire bubble_F;   // if 1, uses default_pc instead of x_pc

always @(posedge clk) F_pc = stall_F  ? F_pc 
                           : bubble_F ? default_pc 
                           : x_pc;

A pipeline register is a three-option mux around a D flip-flop. Like a flip-flop, it waits for the rising clock edge to do anything. On that edge it checks two control bits to decide how to update the register value.

TODO: flesh this out or abandon verilog.

3.2.2.2 Our HDL

We will use a simplified language based on Verilog to explore the design of hardware. We won’t use Verilog itself in part because we do not want to need to be explicit about rising edges and clock delays and so on and in part because Verilog lets you override the simulator functionality, which we do not want to allow.

3.2.2.2.1 Clocked Registers

Our testing harness will automatically generate a clock signal, select its delay based on the complexity of the work being done within each cycle, and attach it to all registers. Thus, to define a rising-edge-triggered register all you need to do is give it a name and a size (it defaults to 1 bit if not size is given)

reg myOneBitRegister;
reg[63:0] my64bitRegister;

The value stored in the register is available directly from the register name, but you cannot set the value stored in the register directly; you’ll need to put a value into the (implicitly defined wire of the same width as the reg) registerName_in and wait for the next rising clock edge to see the results.

3.2.2.2.2 Wires and Muxes
3.2.2.2.3 Pipeline Registers

Our testing harness provides a built-in function addPipelineControl that takes two 1-bit wires and one or more registers and adds pipeline control signals to those registers, like so:

wire stall_F, bubble_F;
reg[63:0] pc;
reg[3:0] icode;
addPipelineControl(stall_F, bubble_F, pc, icode);
3.2.2.2.4 Functionality you don’t have to write

3.2.2.3 HDL implementation of a simple processor

3.2.3 Pipelining

3.2.3.1 Assembly Lines

3.2.3.2 Dependencies

3.2.3.2.1 Data vs Control
3.2.3.2.2 Hazards
3.2.3.2.3 Data forwarding
3.2.3.2.4 The need to Stall
3.2.3.2.5 Pipeline register control logic

3.2.3.3 Prediction and speculative execution

3.2.3.4 Limitations of pipelines

3.3 Memory Hierarchy

3.4 Peripherals and Busses

4 Kernels: Software for the Hardware

4.1 Kernel Mode vs. User Mode

All multi-purposed chips today have (at least) two modes in which they can operate: user mode and kernel mode. Thus far you have probably only written user-mode software, and most of you will never write kernel-mode software during your entire software development careers.

4.1.1 Motivation

Some activities the computer does can only be done by kernel-mode software; if user-mode code wants to do them, it has to do so by asking the kernel. This restriction provides several advantages.

4.1.1.1 Limiting Bugs’ Potential for Evil

One reason to not run in kernel mode is to limit the scope of mistakes a software bug can have. Inside the computer is all of the code and data that handles drawing the screen and tracking the mouse and reacting to keys and so on; I’d rather not be able to make a mistake when writing my Hello World program and accidentally mess up those core functionalities. By running in user mode, if my code tries to touch any of that it will be blocked from doing so: my program will crash, but the computer will keep running.

4.1.1.2 Each Program Thinks it is In Control

Physical computers have limited physical resources. One of the benefits of running in user mode is that the particular limitations are hidden from you by the kernel and hardware working together. For example, even if you only have 8GB of memory you can still use addresses larger than that limit: the kernel and hardware will work together to provide the illusion that there was enough memory even if there was not.

This illusion is even more important when you are running more than one process simultaneously. Perhaps you have a terminal, an editor, and your developed code all running at once. The kernel and hardware will work together to ensure that your code cannot even see, let alone modify, the memory being used by the terminal or editor. Each program is given the illusion of being the only one running on an unlimited machine.

4.1.1.3 Wrapper around Hardware

Each piece of hardware attached to a chip has its own protocols and nuances of operation. Fortunately, only the kernel needs to know about those details. If the mouse tries to talk to your program directly, the kernel interrupts the communication and handles it for you so you don’t notice it happened. If you try to talk to the disk, the kernel likewise moderates that communication, keeping you from needing to know the specific protocol the currently-attached disk expects.

4.1.1.4 Multiple Users

Because the kernel is an unavoidable intermediary for talking to hardware, it can chose to forbid some interactions. One common use of this is to allow multiple user accounts. Since the kernel is the one that controls access to the disk, for example, it can chose not to allow my processes to access the part of the disk it has assigned to you, not me.

4.1.2 Implementation

4.1.2.1 Protected Instructions and Memory

When in user mode, there are a set of instructions (including the change-mode instruction) that cannot be executed and segments of memory that cannot be accessed. Examples include the part of memory that stores information about which user account you have logged into and instructions that send signals out of the chip to communicate with disks, networks, etc.

4.1.2.2 Mode Bit

The simplest way to create two modes is to have a single-bit register somewhere on the chip that indicates if the chip is currently executing in user mode or in kernel mode. Each protected instruction then checks the contents of that register and, if it indicates user mode, causes and exception instead of executing the instruction. By making the change-mode instruction protected, the user mode code is thus prevented from executing instructions it should not execute and prevented from changing itself into kernel mode to bypass this protection.

One consequence of having a protected change-mode instruction is that, while kernel-mode code can change itself into user-mode code freely, user-mode code cannot change back to kernel-mode code on its own. Thus, we need a different way to handle becoming kernel-mode again.

4.1.2.3 Mode-switch via Exceptions

The core mechanic for becoming the kernel is a hardware exception. Exceptions are different from simply changing a mode bit because they also always jump to some kernel code. Thus, the only mechanisms that exist for entering kernel mode will running kernel code for the next instruction, preventing user code from running itself in kernel mode.

The nature of these hardware exceptions and the jump to kernel code that is associated with them is a large enough topic to deserve it’s own section.

4.2 Exceptions

Hardware exceptions are examples of exceptional control flow: that is, the sequence of instructions being operated around an exception is an exception to the usual rules of sequentiality and jumps. They also tend to appear in exceptional (that is, unusual) situations.

4.2.1 Kinds

There are several kinds of hardware exceptions.

4.2.1.1 Interrupts

An interrupt occurs independently from the code being executed when it occurs. It might result from the mouse moving, a network packet arriving, or the like.

Interrupts are typically handled by the kernel in a way that is invisible to the user code. The user code is frozen, the interrupt handled, and then the user code resumed as if nothing had happened.

4.2.1.2 Faults

A fault is the result of an instruction that in theory could have succeeded failing instead. Examples include dividing by zero, referencing a null pointer, or attempting to execute a protected instruction while in user mode.

There are two basic families of responses to faults: either the kernel can fix the problem, then re-run the user code that generated the fault; or it cannot fix the problem so it kills the process that cause the fault instead. In practice, fixable faults happen quite often, notably the page fault discussed later. Even faults the kernel can’t fix on its own are often handled by asking the user code if it knows how to fix them using a mechanism called signals, though many programs do not chose to handle the signals so the end result is often still termination of the fault-generating process.

4.2.1.3 Traps

A trap is an exception caused by a special instruction whose sole job is to generate exceptions. Traps are the main mechanism for intentionally switching from user mode to kernel mode and are the core of all system calls. System calls are used to interact with the file system, spawn and wait on threads, listen for key presses, and anything else that you cannot do with simple code alone.

4.2.2 Handling

Handling exceptions means switching to kernel mode and jumping to kernel code, potentially without any particular instruction causing the jump.

4.2.2.1 Save-Handle-Restore

The basic mechanism for any exception to be handled is

  1. Save the current state of the processor (register contents, PC location, etc)
  2. Enter kernel mode
  3. Jump to code designed to react to the exception in question, called an exception handler
  4. When the handler is finished executing, enter user mode and restore processor state (including register contents, PC location, etc)

These steps (except for the actual execution of the exception handler) are all done atomically by the hardware, not be software: while saving or restoring machine state the state is busy and no instructions are running.

4.2.2.2 Which Handler?

In general, there can be as many different exception handlers as there are exception causes. To select which one to run, the hardware consults something called an exception table. The exception table is just an array of code addresses; the index into that array is determined by the kind of exception generated (e.g., divide-by-zero uses index 0, invalid instructions use index 6, etc.) The index is called an exception number or vector number and the array of code addresses is called the exception table.

4.2.2.2.1 Switch

This idea of an array of code addresses is not unique to exception handlers; it is also present in C in the form of a switch statement which can be compiled to a long set of conditional jumps but is designed to turn into an array of addresses instead.

For example, the following C

switch(x) {
    case 0: f0();
    case 4: case 1: f1();
    case 2: f2();
    case 3: f3();
        break;
    case 5: f5();
    case 6: f6();
}

compiles down to the following code

    cmpl    $6, %eax
    ja      AfterSwitch
    jmpq    * Table(,%rax,8)
Case0:
    callq   f0
Case41:
    callq   f1
Case2:
    callq   f2
Case3:
    callq   f3
    jmp     AfterSwitch
Case5:
    callq   f5
Case6:
    callq   f6
AfterSwitch:

supported by the following jump table

    .section    .rodata,"a",@progbits
Table:
    .quad   Case0
    .quad   Case41
    .quad   Case2
    .quad   Case3
    .quad   Case41
    .quad   Case5
    .quad   Case6

Exception tables are just one particular use of this array-of-code-addresses idea.

4.2.2.3 Example: Linux system calls

In Linux, almost all communications from user- to kernel-code are handled by a trap with exception number 128. Instead of using exception number to select the particular action the user is requesting of the kernel, that action is put in a number in %eax; this decision allows Linux to have space for billions of different system calls instead of just the 200 or so that would fit in an exception number and are not reserved by the hardware.

4.2.3 Exception-Like Constructs

4.2.3.1 Signals

4.2.3.2 setjmp and longjmp

4.2.3.3 Software Exceptions

4.3 Virtual Memory

4.3.1 Regions of Memory

4.3.1.1 Segments revisited

4.3.1.2 Pages

4.3.1.3 Protection in HW, OS

4.3.2 Page Tables

4.3.2.1 Single-Level

4.3.2.2 Multi-Level

4.3.3 Usage

4.3.3.1 Page Swapping

4.3.3.2 Process Swapping

4.3.4 Virtual Memory vs Cache

5 Software Impact

5.1 Program Speed

5.1.1 Fast to Write or Fast to Run?

5.1.2 Performance Doesn’t Matter Unless it Matters

5.1.2.1 Amdahl’s Law: Not the Bottleneck

5.1.2.2 Fast Enough

5.1.2.3 Responsive ≠ Fast

5.1.2.4 Fast ≠ Maintainable

5.1.3 What Optimizing Compilers Do

5.1.3.1 Static Optimizations

5.1.3.2 Just-In-Time Compilation

5.1.4 Optimization in Practice

5.1.4.1 Keep the Readable Code Too

5.1.4.2 Regression Testing

5.1.4.3 Lossy vs Lossless

5.2 Optimizing for Speed

5.2.1 Unnecessary Work

5.2.1.1 Algorithm Choice

5.2.1.2 Data-Structure Choice

5.2.1.3 Moving Work Outside Loops

5.2.1.4 Loop Unrolling

5.2.2 Assembly-Level Overhead

5.2.2.1 Function Inlining

5.2.2.2 Recursion → Loop

5.2.3 Cache Locality

5.2.3.1 Loop Reordering

5.2.3.2 Loop Blocking

5.2.3.3 Other Effects

5.2.4 Instruction-Level Parallelism

5.2.4.1 Branch Prediction

5.2.4.2 Associativity

5.2.4.3 Multiple Accumulators

5.2.5 SIMD, Fibers, and Threads