CS 3330 - Draft Text Replacement

© 2016-08-13 Luther Tychonievich

In progress; begun 2016-03-08

1 Warning

This text is in alpha draft mode. At this stage text is added ad-hoc without structure or proof reading. Some of its assertions may be falsehoods or dummy text I have inserted as a temporary placeholder. Do not trust anything this text says without independently verifying it from another, more reliable source.

You have been warned.

2 Systems-level Programming

2.1 C

There are at least three reasons that every software developer should learn C.

C is pervasive. It is used to write some of the largest code bases in the world and some of the most-used tools in existence. If you ever write low-level code, you will almost certainly at least see C.
C is close to, but cleaner than, assembly. There exists a simple, direct mapping between C and assembly, no matter which assembly you are looking at. This makes it a good language for exploring how software and hardware interact.
Almost every other language speaks C. Because C is simple and pervasive, it is hard to find a language that lacks a C interface. If you want to make a library that everyone else can use, write it in C.

There are also reasons that no software developer should use C, notably that the language’s design includes several choices that we have since learned encourage unsafe code. Lack of type safety, for example, is dangerous. D or Rust are probably better choices if you want to write low-level code, but they are neither pervasive nor close to assembly, so we’ll use C.

Aside: the Ternary Operator

Many introductions to programming omit the ternary operator ?:. The expression a ? b : c’s value is the value of b if a is true, otherwise it is the value of c. Thus the ternary operator is an expression-level conditional, as opposed to if which is a statement-level conditional.

2.1.1 `C++--`

You have learned (some) C++. C++ is almost backwards compatible with C. C has the same primitive types (the integer types char, short, int, and long, in both signed and unsigned variants; the floating-point types float and double; and the pointer type void *) and has arrays, functions, and structs; but it does not have classes or methods or operator overloading.

The lack of classes and operator overloading means most of the C++ libraries you have used do not work in C. There is no new or delete operators, just the functions malloc and free. There is no cout object with its overloaded << operator, just the function printf. Etc.

C is less type-strict than C++. It will implicitly cast all kinds of things for you: pointers and integers can be used interchangeably, for example. This freedom can be confusing, both because it allows looser coding styles than you are used to and because it does not go quite as far as you might expect. For example, under most circumstances int a[] and a int *a are interchangeable (they compile to the same assembly) but for a few operations (like sizeof that are evaluated by the compiler, not the assembler) they behave differently.

C does not have pass-by-reference. If you need by-reference semantics, pass a pointer instead.

We will not spend enough time with C to be worth enumerating or explaining all of its nuances.

2.1.2 `-ansi -pedantic-errors`

There are many flavors of C. Some of the code we’ll have you write for this class will require a pedantic interpretation of the ANSI specification of the C language. There are two aspects of this standard that experience has taught me trip up C++ programmers:

// ... is not a comment, only /* ... */ is
You cannot mix declarations and code within a function. There must exists a line above which every statement is declaring a variable and below which no statement is declaring a variable. Notably, this means that for(int i = 0; is always wrong; for is not a declaration so int i cannot follow it.

It is common for C functions to open with a set of declarations, like so:
```
int foo() {
    int a,b; float d;

    b = 0;
    for(a = 0; a < 10; a += 1)
        b += a * a;
    d = sqrtf(c + b);
    return d*(1+b);
}
```

Most C compilers will enforce these constraints if you give them the -ansi -pedantic-errors flags on the command line.

2.1.3 Optimization flags

Most C compilers have multiple levels of optimization available, specified with -O# flags on the command line. A summary of what the most common ones do:

-O0: No optimizations. All variables are stored in memory, program registers are used only as necessary. Tends to create long code with many unnecessary moves.
-O1: Simple optimizations only. Uses registers where possible, may combine a few operations into one assembly instruction if it can do so, but does not do any major reordering of code. Tends to create the easiest assembly to read.
-O2 or simply -O: Optimize for speed. Will reorder operations, remove unreachable code, add temporaries to store common subexpression results, etc. A sensible default for your own code.
-O3: Aggressive optimizations. Some of these make the code significantly longer in hopes of gaining small speedups.
-Os: Optimize for small code footprint. It may run slower but the executable file will be smaller.

2.1.4 Target Architecture

In this course we will see both of the currently-common architectures: x86-64 (also called amd64 or IA64) and one of the current ARM architectures (in particular ARMv8.1-A). To that end, it is useful to know how to cross-compile C; that is, how to run your compiler on one architecture and generate assembly for a different architecture.

The following command will have clang compile foo.c to ARMv8.1-A assembly file foo.s:

clang -target armv8.1a -mhard-float -S foo.c   # 32-bit
clang -target arm64 -S foo.c                   # 64-bit

To cross-compile using gcc requires installing a separate gcc compiler, such as the gcc-arm-none-eabi available using apt-get. ARM is less supported by GCC than it is by LLVM.

2.1.5 32- vs 64-bit

Most ISAs support numbers of various sizes: 1-byte chars and 2-byte shorts and 4-byte longs and 8-byte long longs, for example. However, most support only one size for pointers, and that fixed size has become a common identifier for the architecture as a whole. Thus we have 32-bit architectures (meaning they have 32-bit (or 4-byte) pointers) and 64-bit architectures.

Sometimes it is nice to use size of a pointer integers. C offers a size_t which is a pointer-sized unsigned integer; in most compilers this is built-in but in some you have to #include <stddef.h> to get access to it.

2.1.6 Calling Conventions

When one function calls another, how does it send in the argument values and how does it retrieve the return value when the function completes? The answer is it’s just a convention (i.e., nothing stops you from doing something else), but so pervasive a convention that hardware architectures often specify their suggested approach.

2.1.6.1 Stack-based linkage

One options is arguments and return values go on the stack. We make sure that the first argument is on the top of the stack, the next below it, and so on; this order allows variable-arity functions like printf to determine how many arguments there are based on the contents of the earlier arguments. When returning, we push the return value and then return; the calling function an find it be examining the callees stack addresses.

In practice, this means that to call foo(a,b,c) you push c, then push b, then push a, then jump to the code for foo.

32-bit x86 uses stack-based linkage for arguments. Return values that require more than 4 bytes are placed on the stack and the address to them is returned. Smaller return values use register-based linkage.

64-bit x64 uses stack-based linkage only once it runs out of registers to pass arguments in.

2.1.6.2 Register-based linkage

The other option is to put parameters and return values in registers.

In x86-64, the first six arguments are passed in %rdi, %rsi, %rdx, %rcx, %r8, and %r9; additional parameters spill over into stack-based linkage. In practice, this means that to call foo(a,b,c,d,e,f,g,h) you push h, then push g; and you put a through f into the registers %rdi through %r9; and then you jump to the code for foo.

Return values are handled in a similar way: in x86 (both 32- and 64-bit version); return values are placed in %eax or %rax.

In ARMv8, arguments are passed in r0, r1, r2, and r3, with additional parameters on the stack; and return values are placed in r0.

2.1.7 Memory Layout

From the programmer’s point of view, memory is just a huge array of bytes. We’ll explore a lot about this later, but there are a few items to know now.

2.1.7.1 Alignment

Most memory accesses are for values 4-bytes long or larger. Because of this, memory chip manufacturers have optimized their designs so that if you ask for an address that is a multiple of 4, you get an answer faster than if you ask for an address that is not a multiple of 4.

Not all memory chips have the optimization listed above and even for those that do, the actual performance is more nuanced than described above, but the trend is strong enough that it is typical to align values to a four-byte boundary. Thus something like struct{ char a; int b; char c; } is likely to be 12-bytes, not 6-bytes, in memory.

It is possible to override alignment in C, but rarely wise to do so. Bytes are often cheaper than cycles.

2.1.7.2 Stack, heap, data, text

The average user program uses memory for four kinds of things:

The code itself. The region where code is stored is traditionally called the text region of memory.
Global variables. It is common to treat initialized and uninitialized global variables differently as a way of optimizing the executable size. The initialized globals are stored in what is called the data region of memory, while the uninitialized are in the bss region.
The stack. Not all languages use a call stack (some older languages cannot support recursion and use bss for activation records; some newer languages store activation records on the heap) but most do; meaning they need a region of memory for the stack to grow into.

The stack stores local variables, parameters, return values, return addresses; but so do program registers. With few exceptions, the compiler gets to decide what is in registers and what goes on the stack; which are where is rarely determinable from source code alone.
The heap. Anything you create with new in C++ or malloc in C goes into the heap.

2.1.7.3 Shared libraries, kernel code, padding

There are also three other regions of memory that exist, but are not directly related to code the programmer wrote.

Shared libraries. One of the stages of compilation is linking, which is what attaches your code to built-in and imported library code like printf. Some linking is static, adding functions to the text region, but others is dynamic or shared, adding functions to a special region of memory that can be used by multiple running programs simultaneously.
Kernel code. There are some bits of code that must always be in memory at the same address all the time no matter what program is running; an example is the code that reacts to a packet of data arriving from the mouse. This is called the kernel code and is stored in its own region of memory.
Unused padding. In C (and most other languages) NULL is a synonym for 0. Thus, if a is null then something like a[15] or a->foo is almost certainly a mistake (null pointers don’t store data) but literally means just access to some small address. These mistakes are so common that it is typical to block off all the small addresses are unused padding so that null-pointer dereferencing goes into the unused space and not into some other used part of memory.

2.1.7.4 Segments

The different regions of memory mentioned above are more than just conceptual ideas; they have software realization in the operating system’s list of segments. Each segment identifies a range of addresses and what operations are permitting on those segments. For example, it is common for the text segment to be executable but not writable, the heap segment to be writable but not executable, etc.

Segments are not a perfect match for the conceptual regions. For example, the data and bss regions may be combined into a single segment, or the data region might be split into two segments: one that is writeable and one that is not. String literals are one example of data that is typically stored in a read-only segment of memory.

Segments are defined by software so that their size and placement can be customized to each program’s needs. However, having software check the segment permissions for every memory address before it is accessed is prohibitively time consuming. Thus, hardware has a mechanism for checking permissions of addresses in a way that only occasionally needs software input. We’ll explore more about this hardware/software interaction with memory later.

Approximate Layout of memory as implemented on common Linux systems.

Note that both the stack and heap frequently grow and shrink during program execution. The stack in x86 grows down (i.e., push makes the address in the stack pointer smaller); the heap traditionally grows up (i.e, each malloc generally returns a larger address than the malloc before it).

Part of the reason for the different directions is that the shared code region also sometimes grows during execution due to runtime linking; it is advantageous to have the unallocated stack and heap space adjacent to the shared code region to allow that growth.

2.2 Assembly + Debugging

This section assumes you have already learned the basics of assembly code and debugger usage. It is intended as a review, not a first exposure.

2.2.1 ISA vs Assembly

For much of this course we will be dealing with ISAs (Instruction Set Architectures), meaning the design of the processor at the level of what instructions the processor understands, along with the encoding and semantics of each instruction.

Assembly languages are mostly just mnemonic wrappers around instructions in an instruction set, using abbreviations (like pop %r14) instead of the byte values those abbreviations represent (like 41 5e).

Assembly code offers one feature absent from assembled instructions: labels. In the ISA you can’t say jump to foo, only jump to address 0x123, and addresses tend to change during coding so assembly lets you define labels like foo: to say I’m calling the address of the next thing, wherever it ends up, foo and lets you use label names anywhere the ISA expects an immediate value (e.g., jmp foo).

The previous sentence also raises a vocabulary item: the concept you learned to call literal values in programming is (almost) the same as what assembly and machine code calls immediate values: if a value appears directly in the assembly or instruction encoding, it is an immediate value.

2.2.2 Common Assemblies

Assembly languages, like most parts of computing, is subject to changing fashions and technologies. At the time this document was written (a.d. 2016) there are three main assemblies in use.

We’ll compare how each of the assemblies represents the foo function in the following file:

char c = 0;
int i = 2;
long l[8];

void foo() {
    c += i;
    l[i] += c;
}

2.2.2.1 x86 Intel-syntax

This is the default in Microsoft toolchains.

Intel-syntax x86 and x86-64 assembly uses the format operation destination, source. It uses the same operation name for all sizes (e.g., mov is used for 8-, 16-, 32- and 64-bit moves). Register names are given in raw form; memory access uses size PTR address format.

example:

foo:
    mov eax, DWORD PTR i[rip]
    mov edx, eax
    add dl, BYTE PTR c[rip]
    mov BYTE PTR c[rip], dl
    cdqe
    movsx   rdx, dl
    add QWORD PTR l[0+rax*8], rdx
    ret

Information is typically quite dense in assembly; each letter might mean something. For example, the instruction movsx above means move, sign-extending. How much to sign-extend is determined from the operand sizes: from a 1-byte register (dl) to an 8-byte register (rdx).

In Intel syntax, the following names are used for different operand sizes:

bytes	English	assembly	register
1	byte	`BYTE`	`al`
2	word	`WORD`	`ax`
4	double word	`DWORD`	`eax`
8	quad word	`QWORD`	`rax`

2.2.2.2 x86 AT&T-syntax

This is the default in GNU toolchains.

ATT-syntax x86 and x86-64 uses the format operation source, destination. It uses single-letter suffixes on operations to indicate the size of the operands. Register names are prefixed by %, immediates by $; memory is indicated using ().

example:

foo:
    movl    i(%rip), %eax
    movl    %eax, %edx
    addb    c(%rip), %dl
    movb    %dl, c(%rip)
    cltq
    movsbq  %dl, %rdx
    addq    %rdx, l(,%rax,8)
    ret

Information is typically quite dense in assembly; each letter might mean something. For example, the instruction movsbq above means move, sign-extending, from byte to quad-word. The size in the operation is also given by the size in the operands.

In AT&T syntax, the following names are used for different operand sizes:

bytes	English	assembly	register
1	byte	`b`	`%al`
2	word	`w`	`%ax`
4	long word	`l`	`%eax`
8	quad word	`q`	`%rax`

2.2.2.3 ARM

There are many ARM assemblies; common as of 2016 are ARMv7 and ARMv8, the latter having both 32- and 64-bit versions. ARM assembly has several differences from x86-64, including

three-operand mathematical operators with destination first (i.e., and r1, r2, #255 means r1 = r2 & 255),
# prefix on immediate values, and
register-then-memory ordering for both load and store (e.g., ldrb r2, [r1] means *r1 = r2 but strb r2, [r1] means r2 = *r1).

Size of operation in ARM is given only by (optional) size suffixes on the operands (no suffix means to use the default width for the architecture); in 32-bit ARM registers (mostly) named r (e.g., r0); in 64-bit ARM, 32-bit registers are named w and 64-bit x (e.g., w0 and x0). Both variants use 32-bit registers for 16- and 8-bit values as well.

32-bit example:

foo:
    ldr   r0, .i
    ldr   r1, .c
    ldr   r0, [r0]
    ldrb  r2, [r1]
    add   r2, r2, r0
    strb  r2, [r1]
    ldr   r1, .l
    and   r2, r2, #255
    ldr   r3, [r1, r0, lsl #2]
    add   r2, r2, r3
    str   r2, [r1, r0, lsl #2]
    bx    lr

64-bit example:

foo:
    adrp  x8, i
    ldrsw x8, [x8, :lo12:i]
    adrp  x9, c
    ldrb  w10, [x9, :lo12:c]
    adrp  x11, l
    add   x11, x11, :lo12:l
    ldr   x12, [x11, x8, lsl #3]
    add   w10, w10, w8
    strb  w10, [x9, :lo12:c]
    and   w9, w10, #0xff
    add   x9, x12, w9, uxtw
    str   x9, [x11, x8, lsl #3]
    ret

It is worth noting that the two languages are significantly different; while x86-64 is a superset of x86 with a few extra registers and instructions, AArch 64 is a redesign compared to AArch 32. Also notice the two-step address computation in AArch 64, needed because addresses are twice as long as instructions in that architecture.

2.2.2.4 LLVM

The LLVM family of compilers is gaining popularity and is becoming the default on some platforms. The design of LLVM (an acronym for Low-Level Virtual Machine) is based around an assembly-like intermediate language. LLVM intermediate language assembly is unlike machine architecture assembly in several ways, notably that it includes some typing information and has an unlimited number of registers. You will likely never see LLVM pseudo-assembly in the wild, but in case you are interested I’ve included an example of what it looks like here:

define void @foo() #0 {
  %1 = load i32* @i, align 4, !tbaa !1
  %2 = load i8* @c, align 1, !tbaa !5
  %3 = zext i8 %2 to i32
  %4 = add nsw i32 %3, %1
  %5 = trunc i32 %4 to i8
  store i8 %5, i8* @c, align 1, !tbaa !5
  %6 = sext i8 %5 to i64
  %7 = load i32* @i, align 4, !tbaa !1
  %8 = sext i32 %7 to i64
  %9 = getelementptr inbounds [8 x i64]* @l, i64 0, i64 %8
  %10 = load i64* %9, align 8, !tbaa !6
  %11 = add nsw i64 %6, %10
  store i64 %11, i64* %9, align 8, !tbaa !6
  ret void
}

2.2.3 Low-level Debugging

2.2.3.1 `gdb` and `lldb`

2.2.3.2 `objdump`

2.2.3.3 …

2.3 Data Revisited

2.3.1 Numbers

This section assumes you have already learned the basics binary represenation of numbers and floating-point representations. It is intended as a review, not a first exposure.

2.3.1.1 Binary, Hexadecimal, and Octets

Processor hardware operates in base-2 or binary. It does this because it takes less space and power to distinguish between two voltage states (high and low) than between three or more. Anything we might gain from 3- or higher-state logic we can also gain by using several bits together.

There are two common cluster-of-bits sizes that are also used: the 4-bit nibble or nybble and the 8-bit byte or octet.

2.3.1.1.1 Binary

Digits are taken from the set of bits: {0, 1}. Place-value notation is used with base 2. Thus the number 110100000010₂ (or 0b110100000010) represents the number

2¹¹	2¹⁰	2⁹	2⁸	2⁷	2⁶	2⁵	2⁴	2³	2²	2¹	2⁰
1	1	0	1	0	0	0	0	0	0	1	0

2¹¹ + 2¹⁰ + 2⁸ + 2¹ = 2048 + 1024 + 256 + 2 = 3330

Math works the same in base-2 as it does in base-10:

            1      11      11    1 11    1 11 
  1011    1011    1011    1011    1011    1011
+ 1001  + 1001  + 1001  + 1001  + 1001  + 1001
------  ------  ------  ------  ------  ------
             0      00     100    0100   10100

2.3.1.1.2 Hexadecimal

Digits are taken from the set of nibbles: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f}. Place-value notation is used with base 16. Thus the number d02₁₆ (or 0xd02) represents the number

16²	16¹	16⁰
d	0	2

d × 16² + 1 × 16⁰ = 13×256 + 2×1 = 3328 + 2 = 3330

Hexadecimal is particularly noteworthy because it is easy to convert to and from binary: each nibble is simply 4 bits. Thus

d	0	2
1101	0000	0010

Hexidecimal is commonly used by humans who want to communicate in binary without getting lost in the middle of a long string of 0s and 1s.

Math works the same in base-16 as it does in base-10:

            1       1       1    1  1    1  1 
  d0b2    d0b2    d0b2    d0b2    d0b2    d0b2
+ 300f  + 300f  + 300f  + 300f  + 300f  + 300f
------  ------  ------  ------  ------  ------
             1      c1     0c1    00c1   100c1

2.3.1.1.3 Octets

1 octet = 2 nibbles = 8 bits. Octets are typically written using pairs of nibbles directly; thus the digits are taken from the set of bytes: {00, 01, 02, … fd, fe, ff}. I have never yet seen a human do math directly in base-256.

Octets are noteworthy because, while processor mathematical circuits use base-2, most of the rest of computing is based on octets instead. Memory, disk, network, datatypes, assembly—all used bytes, not bits, as the smallest unit of communication.

2.3.1.1.4 Base-2 logs and exponents

Power-of-two numbers show up a lot in hardware and hardware-interfacing software. It is worth learning the vocabulary used to express them.

Value	base-10	Short form	Pronounced
2¹⁰	1024	Ki	Kilo
2²⁰	1,048,576	Mi	Mega
2³⁰	1,073,741,824	Gi	Giga
2⁴⁰	1,099,511,627,776	Ti	Tera
2⁵⁰	1,125,899,906,842,624	Pi	Peta
2⁶⁰	1,152,921,504,606,846,976	Ei	Exa

In all cases above, the i is usually dropped to just say (e.g.) G instead of Gi. The i clarifies that we mean the base-2 power, not the base-10 power. G could mean either 2³⁰ or 10⁹, numbers the differ by about 7%; which one is meant can only be guessed from context unless the clarifying i is present (which it usually is not).

Only the first three rows in the table above (K, M, and G) will show up very often in this course. We’ll never ask you to know the contents of the base-10 column during this course.

2²⁷ = 2⁷ 2²⁰ = 128M. This pattern works for any power of 2: the 1’s digit of the exponent becomes a number, the 10’s digit becomes a letter. Thus

Value	Split	Written
2²⁷	2⁷ 2²⁰	128M
2³	2³ 2⁰	8
2³⁹	2⁹ 2³⁰	512G

If you memorize the values of 2⁰ through 2⁹, you can then do these exponentiations in your head. That is worth memorizing; these things show up too often to be worth wasting thought on all the time; memorize them now and you’ll have less cognitive load in the future.

Logarithms with base 2 (written log₂(n) or lg(n) or sometimes just log(n)) do the inverse of this: lg(64G) = lg(2⁶ 2³⁰) = lg(2³⁶) = 36. Again, these show up so often that you should be able to do them automatically.

Practice

Fill in the rest of the following table.

Exponent	Written As
17	128K
3
38
11
	256M
	16G
	32

2.3.1.2 Three kinds of Negatives

How should negative numbers be represented in hardware? Three answers are often taught in courses like this, and three are actually used, but they are not the same three. We’ll explore each in terms of base-10 numbers first, then see how they map to binary.

All of these solutions depend on access to a finite number of bits/digits/nibbles/bytes to store numbers in. Our examples will assume we have four digits.

Two’s Compliment

This version picks a number (typically half of the maximum number we can write, rounded up) and decides that that number and all numbers bigger than it are negative. How negative? To understand that, we need to observe a ring-like behavior of numbers.

What is 0000 - 0001 ? The answer is based on place-value subtraction’s notion of borrowing: 0 - 1 is 9 borrow 1. We end up borrowing from all spots to get 0000 - 0001 = 9999 with a stray borrow overflowing and getting lost due to finite precision. Since 0000 - 0001 is obviously negative 1, we set 9999 to be negative 1. One less than negative 1 is negative 2 = 9998, and so on.

Two’s compliment is nice because the three most common mathematical operators (addition, subtraction, and multiplication) work the same for signed and unsigned values. Division is messier, but division is always messy.

Two’s compliment is used for signed integers.

Visualization of two’s compliment numbers in binary.

Zero is all zero bits. Adding and subtracting 1 makes numbers one larger or smaller, except at the transition between 01...1 and 10...0.

There is one more negative number than there are positive numbers.

Biased

This version picks a magical fixed number (typically half of the maximum number we can write, rounded down) and calls it the bias. For our numbers (0000 through 9999) that bias would be 4999. We then define the meaning of number x to be the value of x − bias. Thus seven is written 5006 and negative seven is written 4992.

Biased numbers are nice in that addition and subtraction and comparison operators (< and > and their friends) work exactly like they do for unsigned numbers. However, multiplication and division are messier and humans typically have some difficulty seeing the value in the representation.

Biased numbers are used for the exponent of floating-point numbers.

Visualization of biased numbers in binary.

Zero is 01...1. Adding and subtracting 1 makes numbers one larger or smaller, except at the transition between 1...1 and 0...0.

There is one more positive number than there are negative numbers.

Sign bit

This is the version we all learned in grade school. Negative seven is written exactly like 7, but with a special symbol in the most-significant digits place: -7. With our fixed-width constraint, we’d write that as -007 or +007.

The main downside to sign-bit values is that they have two zeros: -000 and +000. These two are written differently but have the same numerical value, so should -000 == +000 be true or false? We can make any decision we want, of course, but ambiguity is not popular.

Sign bits are used in floating-point numbers.

Visualization of sign-bit numbers in binary.

There are two zeros and two discontinuities.

The negative numbers move backwards, increasing in value when the unsigned representation decreases.

Sign-bit integers, as displayed here, are not used in any common hardware.

There is also a representation called Ones’ compliment that is often taught but that is not used by common desktop, laptop, tablet, or phone hardware.

In C and most modern hardware,

Sign Bit: is used only with floating-point numbers and is in the highest-order-bit’s position. 0 means + and 1 means −.
Biased: is used only with the exponent of floating-point numbers. The bias always has the form 011...11.
Two’s Compliment: is used for all signed integer types (signed char, short, int, long). If the high-order bit is set, the number is negative; if not, it is non-negative. Negating a signed integer can be done by flipping all the bits and adding 1 to the result.

Practice

Fill in the rest of the following table. Assume you are using 6-bit numbers.

Decimal	Two’s-C	Biassed
5	000101	100100
−5	111011	011010
11
−1
	110011
	011111
		101111
		010000

2.3.1.3 IEEE-style Floating Point

A floating-point number consists of three parts:

A sign bit; 0 for positive, 1 for negative. Always present, always in the highest-order bit’s place, always a single bit.
An exponent, represented as a biased integer. Always appears between the sign bit and the fraction.

If the exponent is either all 1 bits or all 0 bits, it means the number being represented is unusual and the normal rules for floating-point numbers do not apply.
A fraction, represented as a sequence of bits. Always occupies the low-order bits of the number. If the exponent suggested this was not the normal-case number, may have special meaning.

There are four cases for a floating-point number:

Normalized

The exponent bits are neither all 0 nor all 1.

The number represented by s eeee ffff is ± 1.ffff × 2^{eeee − bias}. The value 1.ffff is called the mantissa.

Denormalized

The exponent bits are all 0.

The number represented by s 0000 ffff is ± 0.ffff × 2^1 − bias. The value 0.ffff is called the mantissa. Note that the exponent used is 1 − bias not 0 − bias.

Infinity

The exponent bits are all 1 and the fraction bits are all 0.

The meaning of s 1111 0000 is ±∞.

Not a Number

The exponent bits are all 1 and the fraction bits are not all 0.

The value s 1111 ffff is Not a Number, or NaN. There is meaning to the fraction bits, but we will explore it in this course.

Practice

Complete the following table. Assume 8-bit floating-point numbers with 3 fraction bits.

Fraction	Binary	Binary Scientific	Bits
7 / 8	0.111	1.11 × 2⁻¹	0 0110 110
−∞			1 1111 000
1		× 2
	−11110000	× 2
		× 2	0 0001 000
		× 2	0 0000 111
		1 × 2⁻⁹
		× 2	0 1111 111
		× 2	1 0000 000

2.3.2 Aggregates

There are two ways to aggregate types into larger types. You can put multiple values adjacent to one another in memory, or you can store pointers from one value to another.

2.3.2.1 Bytes and Endianness

Processors handle numbers in binary. A binary number has a high-order bit and a low-order bit, with many bits in between. Sometimes we think of the high-order bit as first (for example, when we write it we write the high-order bit first) and sometimes as last (for example, when we do math we get to the high-order bit last).

Memory handles numbers in bytes, base-256. A byte-string number has a high-order byte and a low-order byte, generally with several bytes in between. Again, first and last are not stable ideas when it comes to numbers.

Memory also handles everything else as bytes, which means it does identify the first and last byte in any byte-string. Which means memory can tell us which byte of a number is first: the answer is not a property of the number, just a property of memory layout. The layout is chosen by the processor when it reads or writes bytes in memory.

x86 and x86-64 have decided that the low-order byte gets the low-order address and is thus first in memory. Network protocols have decided that the high-order byte goes over the wire first and is thus first. Other processors and protocols have made their own decisions in these issues. ARM notably can be configured either way.

If the high-order (big value) byte is first, we call the system Big-endian. If the low-order (little value) byte is first, we call the system Little-endian. Both apply only when writing single multi-byte values to memory.

Aside: Arabic Numerals
In English (and most other languages today) we use base-10 and Arabic numerals; thus thirty-five is written 35. The high-order digit is on the left, the low-order digit is on the right. When you put Arabic numbers inside of English text it thus looks big-endian because English is read left-to-right. However, when you put the same number inside of Arabic text it looks little-endian because Arabic is read right-to-left.

Example

Consider the array short x[2] = { 0x1234, 0x5678 };. If &x[0] == 0x108, what byte is in memory in the following locations?

Endianness	107	108	109	10a	10b	10c
Little	?	34	12	78	56	?
Big	?	12	34	56	78	?

Observe that only the order of bytes within a number changes based on endianness; the following are all the same in both:

Each short occupies two bytes.
The address of a short is the address of its first byte, whichever byte that is in the endianness being used.
The elements of an array are in the same order in memory as they are in code
&x[1] == 0x10a (which is 0x108 + sizeof(short)).
It is bytes, not nibbles or bits, that are placed in memory, so we still see byte 0x34 (0b00110100) not nibble-reversed 0x43 (0b01000011) nor bit-reversed 0x2c (0b00101100).

2.3.2.2 Structs and Packing

In C, a struct is a plain-old-data or POD type, meaning it acts (to the degree possible) like a primitive type: passed by value in registers or on the stack as the compiler choses. If you force it to be placed in memory (either by using malloc or by requesting its address) a struct will be aligned, which means that it may take more space than its component parts would seem to require.

Most dialects of C have mechanisms for packed structs which bypass the alignment restrictions. Some of these even let you pack multiple values into a single byte.

To force the compiler to align on one-byte boundaries, use a set of pre-processing #pragmas:

#pragma pack(push)
#pragma pack(1)
struct thing {
    char one;
    int four;
    char one;
};
#pragma pack(pop)

This will make the struct smaller (just 6 bytes long), but also (most likely) slower.

To have sub-byte elements in a struct, use : after the field name:

struct tiny {
    char bit:1;
    char nibble:4;
    short rest:13;
};

This will result in a 2-byte struct where the fields are stored in parts of the bytes. Called bitfields, these are a dense way of storing information but are much slower than ordinary structs because the compiler has to insert various shifts and masks to get at the part of each field being accessed each time it is accessed.

Because structs are treated like primitives, it is worth thinking about how you use them. Because they can be placed in registers, they can be faster than pointed-to data; but because they are passed by value, there is significant overhead to passing large structs to and from functions. Because of the function overhead, it is common to allocate larger structs on the heap using malloc and pass them by pointer, not by value.

2.3.2.3 Arrays and Pointers

An array is simply multiple values of the same type held in contiguous memory locations. The value of an array is the address of its first element. Although C does treat int x[] and int *x different in a few cases, they are mostly interchangeable.

The syntax x[i] means exactly the same thing as the syntax *(x+i), that is the value in memory at address x + i × sizeof(*x). Note that C does the sizeof-computation for you whether you type x[i] or *(x+i). One side effect of that automatic offsetting by the size of the element is that ((char*)x) + 2 is a different address than is ((int*)x) + 2.

A side effect of the value of an array being the address of its first element is that you cannot directly pass an array by its content’s values. struct and primitive types are passed by value, but arrays are passed by address.

2.3.2.4 Case Study: 3 kinds of lists

An important consideration when using lists is how to determine how many elements the list contains. This is interesting in part because the decision made by C is not the decision made by most more recent languages.

Let us consider four options:

Use an array where the length of the array can be determined at compile time.
Use an array where the length of the array is stored in memory or a register.
Use an array where the last element of the array has a special value called the sentinel that, when encountered, means this is the end of the list.
Use a linked list.

There are other kinds of lists, but these are a reasonable sampling. Lists whose size can be determined at compile time are of limited interest, so this section devotes particular attention to the other three.

2.3.2.4.1 Length Array

A length array of TYPE values, also called a dynamic array or an array list, can be stored in C as

typedef struct { size_t length; TYPE *ptr; } lenarr;

Length arrays have the following properties:

They require two registers or a chunk of stack memory equal to two void *s to pass into and out of functions.
They require only enough space to store their contents on the heap.
Determining the length and accessing any element is O(1):
```
length   = x.length;

x.ptr[i] = x.ptr[j];
```
Inserting a new element is O(n) and typically requires allocating O(n) more heap memory.
A sublist can be extracted from a larger list without modifying the original or allocating any heap memory:
```
lenarr middle;
middle.length = x.length / 3;
middle.ptr = x.ptr + middle.length;
```

2.3.2.4.2 Sentinel Array

The end of a sentinel array is denoted by a special value of the type stored in the array. The most common sentinel array in C is the native string type: "hi" is stored in memory as the array {104, 105, 0} where 104 is the ASCII code for h and 0 (or \0) is the null byte, the designated sentinel value for strings.

A sentinel array of TYPE values is just a TYPE *, but it requires a known sentinel that is guaranteed not to occur in the list except as the sentinel. Besides strings, the most common sentinel arrays store pointers and use a NULL pointer as a sentinel.

Sentinel arrays have the following properties:

They require just a single pointer be passed into and out of functions.
They require one extra spot to be allocated in the heap beyond that needed for their contents.
Determining the length is O(n); accessing a particular element is O(1):
```
length = 0;
while(x[length] != sentinel)
    length += 1;

x[i] = x[j];
```
Inserting a new element is O(n) and typically requires allocating O(n) more heap memory.
A sublist can be extracted from a larger list
- without modifying the original or allocating any heap memory if the sublist is a tail of the original:
```
TYPE *end = x + (length_of(x) / 2);
```
- by destroying the original or allocating memory if the sublist is not a tail of the original:
```
x += length_of(x) / 3;
x[length_of(x) / 2] = sentinel;
```

2.3.2.4.3 Linked List

A (singly) linked list of TYPE values can be stored in C using nodes defined as

typedef struct { TYPE payload; node *next; } node;

A linked list is then passed as a node *.

Linked lists have the following properties:

They require just a single pointer be passed into and out of functions.
They require a constant multiple more heap memory than their contents alone require.

Determining the length and accessing any element is O(n):

length = 0;
for(p = x; p != 0; p = p.next)
    length += 1;

for(cnt = 0, p1 = x; cnt < i; cnt += 1) p1 = p1.next;
for(cnt = 0, p2 = x; cnt < j; cnt += 1) p2 = p2.next; 
p1.payload = p2.payload;

Inserting a new element is O(1) and typically requires allocating O(1) more heap memory.

A sublist can be extracted from a larger list
- without modifying the original or allocating any heap memory if the sublist is a tail of the original:
```
node *end = x;
for(i = 0; i < length_of(x) / 2; i += 1)
    end = end.next;
```
- by destroying the original or allocating memory if the sublist is not a tail of the original.
```
len = length_of(x);
for(i = 0; i < len / 3; i += 1)
    x = x.next;
for(i = 0, p = x; i < len / 3; i += 1)
    p = p.next;
p.next = NULL;
```

2.3.3 Code

In addition to data, we need to be able to encode code as bytes in memory too. There are many ways to do this, both because new hardware enables new operations that need new encoding and because of the burden that is backwards compatibility.

2.3.3.1 Variable-width or Fixed-width?

In ARM assembly, every instruction is 4 bytes long. In x86-64, some are 1 byte and some more than a dozen bytes long.

With fixed-width instructions like ARM, it is easy to know where an instruction will be, simplifying instruction memory hardware. Fixed width instructions are typically also use simple patterns of bits internally to mean the same things in each instruction, meaning less hardware is needed to parse the instructions and determine what they are asking the processor to do. On the flip-side, there are limitations in how much information you can put in an instruction; notably, this means that some simple instructions like x = 0x12345678; might not be encodable as a single instruction due to the inability to store the action (setting a register value) and the register and the 32-bit constant all in a single 4-byte value.

One of the best-selling books on hardware has made a big deal out of fixed-width being better than variable-width, but it is not black-and-white.

2.3.3.2 Many or Few?

How many instructions should exist? Obviously we want to be able to request every operation we expect to see (division, storing to memory, etc.) but we also don’t want there to be so many instructions that we have to make hardware to support instructions that very rarely get executed…

This argument seems to have been pretty decidedly won: more instructions is better. Both ARMv8 and x86-64 have more than a thousand distinct instructions. Many textbooks still talk about large instruction sets as bad, but the success of processors shows this argument to be in error. If you are designing your own low-resource embedded processor you should probably go for simplicity, but those who make processors for hundreds of millions of devices do not do so.

2.3.3.3 RISC vs CISC

Around 1980 some researchers at Stanford and Berkeley coined a couple of acronyms that have caught on, despite not describing any real hardware. The first was RISC (Reduced Instruction Set Computing), which is what they called their ideal systems; the other was CISC (Complex Instruction Set Computing), meaning that set of things that they they didn’t like about commercial hardware of their time.

The usual breakdown of the two terms is as follows; I’ve also added a winner column for which option prevails in currently (2016) popular hardware.

Topic	RISC	CISC	Winner
Number of instructions	< 100	> 200	CISC
Number of program registers	≥ 32	≤ 8	RISC?
Program registers	all general-purpose	some have an assigned purpose	CISC
Cycles per instruction	1	varies by instruction	CISC
Bytes per instruction	all the same (usually 4)	varies by instruction	?
Immediate Values	limited (not enough bits to represent all)	unlimited	?
Memory addressing modes	1 register, 1 immediate	2 registers, 1 immediate, 1 shift	CISC?
Mathematical operands	both must be registers	≤1 may be memory	?
Code limitations	some instructions may not follow others	no constraints	CISC
Conditional operations	boolean operands	condition-code registers	CISC
Linkage convention	register-based	stack-based	RISC

There are also architectures that do have significant design decisions beyond those in this table, such as the compiler-controlled instruction scheduling of the Itanium architecture, but they are beyond the scope of this course.

2.3.3.4 Examples

TODO: add this part

3 Hardware Design

3.1 Hardware Components

Processor hardware is built out of three main conceptual pieces: wires, gates, and registers. These behave in some ways much like variables, operators, and data structures but in other ways they are distinct. Understanding their semantics is key to understanding hardware design.

3.1.1 Current, Voltage, Power, and Heat

Ignoring many of the nuances of electricity and speaking only in broad generalities,

Current

is the rate at which electrons flow. In practice it is not always electrons that are the conceptual charge carriers and amperage does not map directly to electrons/second, but that is the general idea.

If electrical state changes, current is involved.

Voltage

is the desire electrons have to flow; the electrical pressure, if you will. Voltage does not always cause current because it contends against resistance: not all substances conduct electricity very easily.

Most elements of computer hardware are sensitive to voltage. The 1s are high voltage and 0s are low voltage. How high depends on the underlying design of the components, but chips with highs of around 1 volt are common today (2016). Low voltage is generally close to 0 volts.

Power

is current × voltage. It does not require power to maintain voltage without current, nor to maintain current without voltage.

Since chip state is voltage and change in state requires current, every time that a processor changes state it uses power. In practice some power is used even between state changes too, since it is hard to maintain voltage in a changeable system without using some current leakage.

Heat

is (by the law of entropy) an inevitable byproduct of power. It is also a major concern for chip manufacturers. Chips are so small and the power they use so concentrated that even a small amount of heat can result in a large temperature change. Electrical properties tend to change with temperature, and notably we have trouble making components that function at higher temperatures.

It is thus vital that chip temperature be kept low enough that the chip continues to function. But to reduce heat we must reduce power; and to reduce power we either need to reduce voltage (meaning more sensitive and harder-to-engineer components) or reduce current (meaning less state change and thus a reduction in chip speed).

In this course we will not explore designing for limited power usage, but the outline given above is useful to be conversant in hardware design issues. It also explains why tablets tend to be slower than servers: they have less power to draw from (a battery instead of a connection to the power plant) and less space to devote to heat dissipation (heat sinks, fans, and so on) so they cannot change as much voltage as quickly as the servers can.

3.1.2 Wires, Gates, and Muxes

3.1.2.1 Basic Components

Wires

are physical paths of conductive material, often copper. Wires have very little resistance, meaning it takes almost no current for the voltage across a wire to equalize. It is safe to assume that everything connected to a wire has the same voltage.

Typically wires are connected between one source or driver of voltage and one or more sinks or recipients of voltage. They act in some ways like assignment operators (sink = source) and in some ways like temporary variables

tmp = source; sink1 = tmp; sink2 = tmp; ...

Gates

are particular arrangements of transistors that perform certain fixed functions. The most common are the boolean gates that take one or two bits in and produce one bit out; examples include and, or, xor, not, nand, etc.

Gates operate continuously, in parallel with all other gates, changing as soon as voltage changes on their input wires. Because they operate with finite current, it takes a non-zero amount of time for the output to reflect the change in the input.

The design of individual gates is outside the scope of this class.

Multiplexers (or Muxes)

are arrangements of gates that act as selectors. The simplest mux implements the same operation as the ternary operator in C: output = selector ? input1 : input0. Although the physical arrangement of gates and wires has three inputs (selector, input0, and input1) we typically call only two of them the inputs (input0 and input1).

Being made out of gates, muxes operate continuously, in parallel with all other gates, changing as soon as voltage changes on their input wires, with some finite delay between a change of input and the corresponding change of output.

Muxes can easily be extended to have more thant two inputs, provided additional selectors are also provided. They always have a single output which is selected from the set of inputs.

A significant part of hardware design at the level we will do it is made out of muxes.

Muxes are traditionally drawn as a trapezoid with the inputs on the long side, the output centered on the short side, and the selectors entering one of the angled sides.

3.1.2.2 Multi-bit components

We will almost never refer to the basic bit-level components listed above again. Instead, we will focus on multi-bit extensions.

3.1.2.2.1 Multi-bit implementation

A multi-bit extension of a wire is just a cluster or wires run next to each other. Eight wires running together can transmit eight bits of information in their voltages. We’ll generally call such a group of wires just a wire or sometimes an 8-bit-wide wire, a wire carrying an 8-bit value, or an 8-bit wire.

A multi-bit mux has multi-bit inputs and output, all the bits of which share a single selector. They can be implemented as a separate 1-bit mux for each bit of output:

out_0 = selector ? in1_0 : in0_0;
out_1 = selector ? in1_1 : in0_1;
out_2 = selector ? in1_2 : in0_2;
...

We’ll use C-like notation for multi-bit gates and other components that emulate C operations like adders. Thus & will mean a bit-wise and, && a logical and, + an adder circuit, and so on.

3.1.2.2.2 Transitioning multi-bit values

It is almost never the case that all of the gates on all of the individual wires of a multi-bit component produce output at exactly the same speed. This means that during the transition time, the outputs of logical components cannot be trusted to be any valid value.

Suppose I have an adder with inputs 0b0001 and 0b0110; the adder is currently producing output 0b0111. Now I change the second input to 0b0111. Eventually the output will change to 0b1000, meaning all four bits change. However, I cannot in general predict the order in which the bits will change; if (for example) the two low order bits change first then there will be a moment in time when the adder is outputting 0b0100. But 0b0100 is not the correct output for either the old or the new inputs.

Not all orders of changes are possible, but it is usually the case that the outputs of a logical component goes through a period of incorrect values during each transition. It is important, then, to design circuits such that these transitional outputs are ignored and not committed to the programmer-visible state of the processor.

3.1.3 Registers

Consider the following circuit:

As we stated in the previous section, the different bits of the output of the adder appear at different times. This means that if we start with x = 0 we might see a sequence of x values like 0, 1, 0, 2, 1, …: in other words, not the sequence we want to see. To fix this we add a register to the circuit:

Registers have one input and one output, as well as a second single-bit input called the clock (drawn at the top of R in the image above; the triangle is a traditional marker for a clock input). They store a value inside them, and most of the time they completely ignore their input and output that internal value. However, at the moment when the clock input transitions from low- to high-voltage (and only at that moment) they read their input into their internal stored value.

The clock inputs of all registers across the entire chip are (conceptually) attached to the same clock, a oscillator that outputs low-voltage, then high-voltage, then low-voltage, then high-voltage, …, at predictable intervals.

Adding the register solves the unpredictable output problem. The current value of R and 1 are sent into the adder; its output jumps about a bit but stabilizes at the value R + 1; then the clock changes from 0 to 1 (called the rising clock edge) and the new value is put into R. Once the new value is in it is output and the adder output starts jumping around again, but that if fine because the register will ignore it until the next rising clock edge.

3.1.3.1 Clock Speed

The clock is important enough to have become one of the most noted stats about chips: a 2GHz clock speed means the clock oscillator completes 2 billion (2×10⁹ not 2×2³⁰) low-high cycles per second. Because virtually every part of a chip contains registers, and because all registers are attached to the clock, clock speed is a direct indicator of overall processor speed.

So, why not just add a faster oscillator to make the chip faster? There are two main reasons:

Faster clock → more changes per second → more current → more power → more heat. Some chips are designed to reduce their own clock speed either to conserve battery life or to manage their temperature.

Access to plenty of power and a more efficient coolant system can reduce this constraint.
It takes time for electrons to move across wires and through transistors. If the clock runs too fast it might grab a pre-stabilized value and cause computations to be incorrect. A faster computer that sometimes decides that 2 + 3 = 1 is not as useful as a slower one that always knows it is 5.

There are two main ways to overcome this limit on effective clock speed:
- Make the chip smaller. Smaller transistors and shorter wires → less distance for charge carriers to travel → faster operation. For many years this tactic yielded impressive gains every year, but transistors are getting so small that there is not much room to improve them in this direction anymore.
- Reduce the work done between each clock cycle. Less work → less need to wait → faster clock; we’ll learn a lot more about this as we discuss pipelining. Once chip shrinkage stopped yielding big speedups pipelining was seen as the next thing to increase, but it also has limitations and more-or-less optimal-depth pipelines have been determined.

Some people chose to overclock their chips. Roughly, they take a chip that was designed to run at one speed and they attach a faster clock to it. Since the shipped clock speed is generally a conservative estimate, sometimes this just works and the processor becomes faster at no extra cost. But sometimes the new clock speed is too fast and the computer overheats and/or fails to compute correctly. Worse yet, often the problem is not immediately manifest, perhaps only overheating after months of use or only making an error in computation on one out of every billion address computations.

3.2 Processors

3.2.1 Steps in executing an instruction

What are the steps involved in a processor executing an instruction? The basic outline is as follows:

Send an address to memory, get an instruction (encoded in bytes) back.
Parse the action being requested by those bytes.
If the instruction has program register operands, read the values of those program registers.
If the instruction does some mathematical or logical operation, perform that work.
If the instruction reads from or writes to memory, do that.
If the instruction updates program register values, write those values into the program registers.
Repeat with the next instruction, either the next in the instruction stream or the one the instruction being executed tells us to jump to.

There are traditional names for these steps, but they are not necessarily the most obvious names. Further, the set of names is smaller than the set of steps, and the allocation of tasks to the various steps is not uniform. We’ll do our best to both explain the steps clearly and indicate common terminology where possible.

3.2.1.1 Fetch-Decode-Execute

Almost all hardware design texts that I have encountered refer to the following steps:

Fetch
Decode
Execute

Some also add a few more steps, such as

Memory
Writeback
PC Update

Despite the commonality of the first three names, there is not an agreed-upon set of work for each. Nonetheless, the names are so common that it is worth knowing approximately what people mean when they use them.

3.2.1.1.1 Fetch

Fetch consists of

Retrieving from a register the address of the next instruction to load. The register that stores this address is alternatively called the Program Counter (PC) or Instruction Register (IR). We’ll use PC for the name of this register throughout this chapter.
Send that address into the memory system to retrieve the instruction’s bytes.

We will typically think of this as going to a special instruction memory: while instructions and data are both stored in the same physical memory, it is often the case that the memory provides specialized access ports optimized for instructions versus data.
Do something with the retrieved bytes.
- Some texts say that the nature of the instruction is interpreted during Fetch, while others put that in Decode instead.
- How much work to attribute to fetch depends in part on if all instructions are the same length (RISC-like) or if some instructions have more bytes than others (CISC-like).
Compute the address of the next instruction in the instruction stream, suitable for inserting back into the PC. This computation is of the form PC + sizeof(instruction).

While the computation is performed in Fetch, the provision of a value to the PC register’s input is not always performed immediately because some instructions jump to a different address instead.

3.2.1.1.2 Decode

Decode is listed as the second step of computation in every architecture text I have read, but the intersection of the work attributed to the decode steps of the texts I am familiar with is empty.

One view of decode is to parse out the bits and pieces of the instruction’s bytes, and wire them to the appropriate logic for the next step.

Another view of decode stays the parsing happened during fetch and decode’s job is to retrieve the value stored in the program register operands used by the instruction.

3.2.1.1.3 Execute

Execute always includes any math or logic operations the instructions requested. It typically also includes deciding if a conditional jump should be taken or not.

Sometimes retrieving the values stored in program registers is characterized as part of Execute, sometimes as part of Decode.

Sometimes retrieving values from or sending values to memory is characterized as part of Execute, sometimes as its own Memory step.

Sometimes writing computed or retrieved values into program registers is characterized as part of execute, sometimes as its own Writeback step.

3.2.1.2 Actions and Dependencies

Part of the reason for there being several groupings of steps in executing an instruction is that the ordering is only partially informed by the hardware. Hardware runs in parallel unless you use registers to force it not to do so, so only actual dependencies between steps cause things to have order. Many linear orderings are generally compatible with a single operation.

Take, for example, the x86-86 push %rcx command. To execute this we have to read it from memory, parse out what it wants us to do, get the values of %rcx and %rsp from the program register file, decrement %rsp by 8, write to memory at that decremented value, store the decremented value back into %rsp, and compute the address of the next instruction in the program. The order we do these in is partially constrained, but only partially; a dependency graph might look like this:

In general each instruction has its own graph of this form, and trying to display them all at once is not really feasible. The goal of ordered steps is to say if you do things in this order, all of the instructions will work out.

3.2.1.3 Muxes to the rescue

We generally want to avoid having separate hardware for each instruction. Fortunately, muxes let us share logic nicely.

Consider, for example, an adder circuit. Many instructions add: push and pop add to the stack pointer (reg1 + ±8), memory operands often involve adding a register and immediate value together (reg1 + immediate), we add when doing math (reg1 + reg2), and so on. But for any given instruction, only one of those things is happening, so we can use just one adder. For example, if we determine that we always add a register value to something, we might end up with a circuit like this:

There are relatively few component actions that a processor carries out; much of the processor design that we will look at will be combining those components with muxes to implement various instructions.

3.2.2 HDLs

Hardware Decription Languages (HDLs) are prevalent throughout hardware design. Two are particularly common: VHDL (VHSIC Hardware Description Language; VHSIC is stands for Very High Speed Integrated Circuit) and Verilog (whose name suggests it’s original purpose, supporting the formal verification of digital logic circuit designs). Both can seem quite verbose at the level of detail we will be working, having extensive support for modeling lower-level details like how adders work as well as ancillary material like how memory operates and how registers react to the clock.

TODO verify the historicity of verilog

We will use a relatively small subset of Verilog in this course, with a few add-ons to simplify common tasks like defining a clocked register.

TODO Verify that I want to do that… I have HCL2D working almost like I want it (some issue with using a built-in output inside a mux); do I want to roll out something new? Teach them enough Verilog to be dangerous?

3.2.2.1 Basic Verilog

Literals

Verilog has two kinds of literal values: sized and unsized.

Unsized values have two forms:
- plain decimal literals (like 3330)
- apostrophe base literal
  - 'b110100000010 (binary)
  - 'd3330 (decimal)
  - 'hd02 (hexadecimal)
An unsized value will be adjusted to fit its target wire or register’s size. It also has no defined bit-width, so some operations that depend on size (like concatenation and splitting) do not work on it.
Sized values (size is in bits) have one form: size apostrophe base literal
- 64'b110100000010 (64-bit binary)
- 64'd3330 (64-bit decimal)
- 64'hd02 (64-bit hexadecimal)
A sized value may only be assigned to wires and registers of the appropriate size.

Apostrophe literals (with or without size) may include underscores as visual separators, which are ignored by verilog. Thus 'b1101_0000_0010, 'h__d_02, and 'd3330 are all the same value.

Mux

wire o = s ? i1 : i2;

There are constraints on where muxes can be defined; we’ll generally restrict them to being an initializer for a wire or being one of the values of another mux.

Wire

wire[63:0] name;

The values in the brackets are the legal bit-indices; the first number is the index of the highest-order bit, the last of the lowest-order bit.

D-style flip-flop

reg[63:0] r; 
wire[63:0] r_in; 
always @(posedge clk) r = r_in;

This is the building-block of most registers: whenever the clock (which needs to have been defined elsewhere) has a positive edge (i.e., goes from 0 to 1) the input is stored into the register.

The output of a flip-flop is just the name of the reg (r in this example). The use of a reg without the corresponding always is beyond the scope of this course.

Pipeline register

reg[63:0] F_pc;  // storage and output signal
wire[63:0] x_pc; // input signal

wire stall_F;    // must be 0 for input to be accepted
wire[63:0] default_pc = 0;
wire bubble_F;   // if 1, uses default_pc instead of x_pc

always @(posedge clk) F_pc = stall_F  ? F_pc 
                           : bubble_F ? default_pc 
                           : x_pc;

A pipeline register is a three-option mux around a D flip-flop. Like a flip-flop, it waits for the rising clock edge to do anything. On that edge it checks two control bits to decide how to update the register value.

TODO: flesh this out or abandon verilog.

3.2.2.2 Our HDL

We will use a simplified language based on Verilog to explore the design of hardware. We won’t use Verilog itself in part because we do not want to need to be explicit about rising edges and clock delays and so on and in part because Verilog lets you override the simulator functionality, which we do not want to allow.

3.2.2.2.1 Clocked Registers

Our testing harness will automatically generate a clock signal, select its delay based on the complexity of the work being done within each cycle, and attach it to all registers. Thus, to define a rising-edge-triggered register all you need to do is give it a name and a size (it defaults to 1 bit if not size is given)

reg myOneBitRegister;
reg[63:0] my64bitRegister;

The value stored in the register is available directly from the register name, but you cannot set the value stored in the register directly; you’ll need to put a value into the (implicitly defined wire of the same width as the reg) registerName_in and wait for the next rising clock edge to see the results.

3.2.2.2.2 Wires and Muxes

3.2.2.2.3 Pipeline Registers

Our testing harness provides a built-in function addPipelineControl that takes two 1-bit wires and one or more registers and adds pipeline control signals to those registers, like so:

wire stall_F, bubble_F;
reg[63:0] pc;
reg[3:0] icode;
addPipelineControl(stall_F, bubble_F, pc, icode);

3.2.2.2.4 Functionality you don’t have to write

3.2.2.3 HDL implementation of a simple processor

3.2.3 Pipelining

3.2.3.1 Assembly Lines

3.2.3.2 Dependencies

3.2.3.2.1 Data vs Control

3.2.3.2.2 Hazards

3.2.3.2.3 Data forwarding

3.2.3.2.4 The need to Stall

3.2.3.2.5 Pipeline register control logic

3.2.3.3 Prediction and speculative execution

3.2.3.4 Limitations of pipelines

3.3 Memory Hierarchy

3.4 Peripherals and Busses

4 Kernels: Software for the Hardware

4.1 Kernel Mode vs. User Mode

All multi-purposed chips today have (at least) two modes in which they can operate: user mode and kernel mode. Thus far you have probably only written user-mode software, and most of you will never write kernel-mode software during your entire software development careers.

4.1.1 Motivation

Some activities the computer does can only be done by kernel-mode software; if user-mode code wants to do them, it has to do so by asking the kernel. This restriction provides several advantages.

4.1.1.1 Limiting Bugs’ Potential for Evil

One reason to not run in kernel mode is to limit the scope of mistakes a software bug can have. Inside the computer is all of the code and data that handles drawing the screen and tracking the mouse and reacting to keys and so on; I’d rather not be able to make a mistake when writing my Hello World program and accidentally mess up those core functionalities. By running in user mode, if my code tries to touch any of that it will be blocked from doing so: my program will crash, but the computer will keep running.

4.1.1.2 Each Program Thinks it is In Control

Physical computers have limited physical resources. One of the benefits of running in user mode is that the particular limitations are hidden from you by the kernel and hardware working together. For example, even if you only have 8GB of memory you can still use addresses larger than that limit: the kernel and hardware will work together to provide the illusion that there was enough memory even if there was not.

This illusion is even more important when you are running more than one process simultaneously. Perhaps you have a terminal, an editor, and your developed code all running at once. The kernel and hardware will work together to ensure that your code cannot even see, let alone modify, the memory being used by the terminal or editor. Each program is given the illusion of being the only one running on an unlimited machine.

4.1.1.3 Wrapper around Hardware

Each piece of hardware attached to a chip has its own protocols and nuances of operation. Fortunately, only the kernel needs to know about those details. If the mouse tries to talk to your program directly, the kernel interrupts the communication and handles it for you so you don’t notice it happened. If you try to talk to the disk, the kernel likewise moderates that communication, keeping you from needing to know the specific protocol the currently-attached disk expects.

4.1.1.4 Multiple Users

Because the kernel is an unavoidable intermediary for talking to hardware, it can chose to forbid some interactions. One common use of this is to allow multiple user accounts. Since the kernel is the one that controls access to the disk, for example, it can chose not to allow my processes to access the part of the disk it has assigned to you, not me.

4.1.2 Implementation

4.1.2.1 Protected Instructions and Memory

When in user mode, there are a set of instructions (including the change-mode instruction) that cannot be executed and segments of memory that cannot be accessed. Examples include the part of memory that stores information about which user account you have logged into and instructions that send signals out of the chip to communicate with disks, networks, etc.

4.1.2.2 Mode Bit

The simplest way to create two modes is to have a single-bit register somewhere on the chip that indicates if the chip is currently executing in user mode or in kernel mode. Each protected instruction then checks the contents of that register and, if it indicates user mode, causes and exception instead of executing the instruction. By making the change-mode instruction protected, the user mode code is thus prevented from executing instructions it should not execute and prevented from changing itself into kernel mode to bypass this protection.

One consequence of having a protected change-mode instruction is that, while kernel-mode code can change itself into user-mode code freely, user-mode code cannot change back to kernel-mode code on its own. Thus, we need a different way to handle becoming kernel-mode again.

4.1.2.3 Mode-switch via Exceptions

The core mechanic for becoming the kernel is a hardware exception. Exceptions are different from simply changing a mode bit because they also always jump to some kernel code. Thus, the only mechanisms that exist for entering kernel mode will running kernel code for the next instruction, preventing user code from running itself in kernel mode.

The nature of these hardware exceptions and the jump to kernel code that is associated with them is a large enough topic to deserve it’s own section.

4.2 Exceptions

Hardware exceptions are examples of exceptional control flow: that is, the sequence of instructions being operated around an exception is an exception to the usual rules of sequentiality and jumps. They also tend to appear in exceptional (that is, unusual) situations.

4.2.1 Kinds

There are several kinds of hardware exceptions.

4.2.1.1 Interrupts

An interrupt occurs independently from the code being executed when it occurs. It might result from the mouse moving, a network packet arriving, or the like.

Interrupts are typically handled by the kernel in a way that is invisible to the user code. The user code is frozen, the interrupt handled, and then the user code resumed as if nothing had happened.

4.2.1.2 Faults

A fault is the result of an instruction that in theory could have succeeded failing instead. Examples include dividing by zero, referencing a null pointer, or attempting to execute a protected instruction while in user mode.

There are two basic families of responses to faults: either the kernel can fix the problem, then re-run the user code that generated the fault; or it cannot fix the problem so it kills the process that cause the fault instead. In practice, fixable faults happen quite often, notably the page fault discussed later. Even faults the kernel can’t fix on its own are often handled by asking the user code if it knows how to fix them using a mechanism called signals, though many programs do not chose to handle the signals so the end result is often still termination of the fault-generating process.

4.2.1.3 Traps

A trap is an exception caused by a special instruction whose sole job is to generate exceptions. Traps are the main mechanism for intentionally switching from user mode to kernel mode and are the core of all system calls. System calls are used to interact with the file system, spawn and wait on threads, listen for key presses, and anything else that you cannot do with simple code alone.

4.2.2 Handling

Handling exceptions means switching to kernel mode and jumping to kernel code, potentially without any particular instruction causing the jump.

4.2.2.1 Save-Handle-Restore

The basic mechanism for any exception to be handled is

Save the current state of the processor (register contents, PC location, etc)
Enter kernel mode
Jump to code designed to react to the exception in question, called an exception handler
When the handler is finished executing, enter user mode and restore processor state (including register contents, PC location, etc)

These steps (except for the actual execution of the exception handler) are all done atomically by the hardware, not be software: while saving or restoring machine state the state is busy and no instructions are running.

4.2.2.2 Which Handler?

In general, there can be as many different exception handlers as there are exception causes. To select which one to run, the hardware consults something called an exception table. The exception table is just an array of code addresses; the index into that array is determined by the kind of exception generated (e.g., divide-by-zero uses index 0, invalid instructions use index 6, etc.) The index is called an exception number or vector number and the array of code addresses is called the exception table.

4.2.2.2.1 Switch

This idea of an array of code addresses is not unique to exception handlers; it is also present in C in the form of a switch statement which can be compiled to a long set of conditional jumps but is designed to turn into an array of addresses instead.

For example, the following C

switch(x) {
    case 0: f0();
    case 4: case 1: f1();
    case 2: f2();
    case 3: f3();
        break;
    case 5: f5();
    case 6: f6();
}

compiles down to the following code

    cmpl    $6, %eax
    ja      AfterSwitch
    jmpq    * Table(,%rax,8)
Case0:
    callq   f0
Case41:
    callq   f1
Case2:
    callq   f2
Case3:
    callq   f3
    jmp     AfterSwitch
Case5:
    callq   f5
Case6:
    callq   f6
AfterSwitch:

supported by the following jump table

    .section    .rodata,"a",@progbits
Table:
    .quad   Case0
    .quad   Case41
    .quad   Case2
    .quad   Case3
    .quad   Case41
    .quad   Case5
    .quad   Case6

Exception tables are just one particular use of this array-of-code-addresses idea.

4.2.2.3 Example: Linux system calls

In Linux, almost all communications from user- to kernel-code are handled by a trap with exception number 128. Instead of using exception number to select the particular action the user is requesting of the kernel, that action is put in a number in %eax; this decision allows Linux to have space for billions of different system calls instead of just the 200 or so that would fit in an exception number and are not reserved by the hardware.