C (guide and reference)

This is intended to be a practical guide (rather than an authoritative guide) to C, as implemented by clang and gcc for the x86-64 processor family.

Data Types

The sizeof(...) operator returns the size of a type in bytes. Thus, sizeof(int) is 4, not 32.

Primitive

Integer

The integer data types are

name	bits	representation	notes
`_Bool`	1 or more	undefined	rarely used; for all types, `0` is false, anything else is true
`char`	8	signedness undefined	usually used for characters, sometimes for bytes
`signed char`	8	2’s complement
`unsigned char`	8	unsigned integer
`short`	16	2’s complement
`int`	32	2’s complement
`long`	32 or 64	2’s complement	32 bits if compiled with `-m32`, 64-bit if compiled with `-m64`, compiler’s choice if neither
`long long`	64	2’s complement

Each has an unsigned version (e.g., unsigned short, etc). If unsigned is used as a type by itself, it means unsigned int.

Integer literals will be implicitly cast to the correct type upon assignment; thus char x = -3 will turn -3 into an 8-bit value automatically, and int x = 'x' will turn 'x' into a 32-bit value. This only works up to int-sized literals.

To force a literal to be long add a l or L to the end; to force it to be unsigned add a u or U. This is generally only needed for very large constants, like unsigned long very_big = 9223372036854775808uL.

Character literals are integer literals written with a different syntax. There is no significant difference between '0' and 48 other than legibility.

Floating-point

The floating-point datatypes are

name	exponent bits	fraction bits	total size	literal syntax
`float`	8	23	32 bits (4 bytes)	`3.1415f` (`f` or `F` for `float`)
`double`	11	52	64 bits (8 bytes)	`3.1415` (no suffix)
`long double`	15	64	80 bits (10 bytes)	`3.1415L` (`l` or `L` for `long`)

Note that long double has traditionally only differed from double on x86 architectures.

Enumerations

The enum keyword is a special way of defining named integer constants, in ascending order unless otherwise specified.

enum { a, b, c, d=100, e };
/* a is 0, b is 1, c is 2, d is 100, and e is 101 */

int f = e; /* equivalent to f = 101 */

void and casting

There is also a special void type that means either “a byte with no known meaning” (if used as part of a pointer type) or “nothing at all” (if used as a return type or parameter list).

Casting between integer types truncates (if going smaller) or zero- or sign-extends (if going larger, depending on the signedness of the value) to fit the available space. Casting to or from floating-point types converts to a nearby representable value (which may be infinity), with the exception that casting from float to int truncates the reminder instead of rounding.

Pointers

For every type, there is a type for a pointer to a value of that type. These are written with a * after the type:

int *x;     /* points to an int */
char *s;    /* points to a char */
float **w;  /* points to a pointer that points to a float */
float ***a; /* points to a pointer that points to a pointer that points to a float */

A pointer to any value stored in memory can be taken by using the address-of operator &. Thus &x is the address of the value stored in x, but &3 is an error because 3 is a literal and does not have an address. You also can’t take the address of the result of an expression: &(x + y) or &&x are both errors as well.

You de-reference pointers with the same syntax used to declare them: a * before the variable.

int *x = &z;     /* x = pointer to z */
int x1 = *x;     /* x1 == z */

You can also de-reference pointers with subscript notation; *x and x[0] are entirely equivalent, as are *(x + n) and x[n].

There is syntactic ambiguity when combining * and [1]. Is *a[1] the same as *(a[1]) or (*a)[1]? This is solved by operator precedence ([] before *), but is not intuitive to most programmers so you should always use parentheses in these cases.

All pointers are the same size (the size of an address in the underlying ISA) regardless of the size of what they are pointing to; thus sizeof(char *) == sizeof(long double *). Two special int types¹ are used to be “an integer the size of a pointer”: size_t is an unsigned integer of this size, and ssize_t is a signed integer of this size. With the compilers and ISAs we are using this semester size_t is the same as unsigned long and ssize_t is the same as long.

When you add an integer to a pointer, the address stored in the pointer increases by a multiple of the sizeof the pointed-to type.

int x = 10;
int *y = &x;                    // y points to x
int *z = y + 2;                 // z points 2 ints after x
long w = ((long)z) - ((long)y); // w is 8, not 2.

Compound Types

There are two basic compound types in C: the struct and the array.

Array

An array is zero or more values of the same type stored contiguously in memory.

int array[1000];             /* an array of 1000 int values */

Except when used with sizeof and &, arrays act exactly like pointers to their first element; notably, this means that array[23] does what you expect it to do: access the 24th element of the array.

The sizeof an array is the total bytes used by all elements of the array:

unsigned x = sizeof(array);  /* 4000: sizeof(int) * 1000    */

The & of an array is the & of its first element (i.e., &array == &(array[0])).

Parentheses are allowed when declaring types, although their meaning is counter-intuitive to many students:

char *pc[10];     /* an array of 10 (char *)s */

char *(pc[10]);   /* an array of 10 (char *)s */

char (*pc)[10];   /* a pointer to an array of 10 (char)s */

The rule here is that we declare variables exactly as we would use them: a pointer to an array would first be dereferenced ((*pc)) and then indexed ((*pc)[i]) to get a char so we declare it as char (*pc)[10].

Arrays literals use curly braces and commas.

int x[10] = {1, 1, 2, 3, 5, 8, 13, 21, 34, 55};

Unless initialized with a literal like this, the contents of an array are undefined (i.e., may be any random values the compiler thinks is most efficient) when created.

Arrays cannot be resized after being created.

Struct

A struct also stores values contiguously in memory, but the values may have different types and are accessed by name, not index.

struct foo {
    long a;
    int b;
    short c;
    char d;
};           /* note the ; at the end; it is REQUIRED! */

The name of the resulting type includes the word struct

struct foo x;
unsigned long y = sizeof(struct foo);
x.b = 1234;
x.a = x.b - 5;

Compilers are free to lay out the data elements of a structure with padding between elements if they wish; this is often done in practice because memory tends to be faster when the address of a 4-byte value is a multiple of 4, so in the above example we expect y to have a value larger than the minimal 15 bytes needed to store the fields of struct foo.

Structures are passed by value; that is, using them as arguments, return types, or with = means that all of their fields are copied. This is inefficient for all by the smallest structs, so often pointers to structures are passed, not the structures themselves.

Because all pointers are the same size, you can have code use a pointer to a struct without knowing what is inside the struct; the fields only need to be known for sizeof and the . operator to work, not for parameter passing.

struct baz;                  /* just says "a struct of this name exists"   */
void swizzle(struct baz *p); /* just says "a function of this name exists" */

/* Swizzles an array of struct bazs                           *
 * This code does not need to understand what a struct baz is */
void swozzle(struct baz **x, int n) {
    for(int i=0; i<n; i+=1) swizzle(x[i]);
}

Structure literals are written using curly braces and commas, optionally with .fieldname = prefixes

struct a {
    int b;
    double c;
};

/* Both of the following initialize b to 0 and c to 1.0 */
struct a x = { 0, 1.0 };
struct a y = { .b = 0, .c = 1.0 };

Unless initialized with a literal like this, the values of fields of a struct are undefined (i.e., may be any random values the compiler thinks is most efficient) when created.

typedef

You can give new names to any type by using the typedef statement:

typedef int Integer;
Integer x = 23;

typedef double ** dpp;
double y0 = 12.34;
double *y1 = &y0;
dpp y = &y1;

struct foo { int x; double y; };
typedef struct foo foo;
foo z;
z.x = x;
z.y = **y;

typedef type names are aliases to the old names; the compiler will treat both the original and new name as equivalent in all type checking.

Sometimes typedef is used with anonymous structs:

typedef struct { int x; double y; } foo;
foo z;
z.x = 3;

Anonymous structs can also be used directly as type names, though that’s quite uncommon

struct { int x; double y; } z;
z.x = 3;

Union

A union is like a struct, except that all of the fields are stored in the same memory address. In practice, this means only one of them has a meaningful value at a time.

union odd {
    long long i;
    double d;
};

union odd x;
x.i = 0x1234;    /* x's memory now contains 34 12 00 00 00 00 00 00 */
double y = x.d;  /* y is now 2.30235e-320 (those same bytes) */

x.d = 0x1234;       /* x's memory now contains 00 00 00 00 00 34 b2 40 */
long long z = x.i;  /* z is now 0x40b2340000000000 (those same bytes) */

You can do bad things

C does not try to prevent you from doing bad things.

float x = 123.567f;   /* A floating-point number */
int y = *((int *)&x); /* An integer made from the same bytes as the floating-point number */

int z[4];             /* An array of 4 integers */
int w = z[254];       /* An integer made from the contents of memory 1000 bytes after the end of z */

const char *s = "hi"; /* compiler makes the string in memory the OS won't allow us to change */
char *t = (char *)s;  /* we get a pointer to that memory that C will allow us to change */
t[0] = 'H';           /* we try to change that memory (the OS will crash our program) */

C’s general attitude is “every rule has an exception” and “the programmer knows best”. It might make you do some complicated casting to do things, but it won’t stop you if you are determined.

Constant

If a type is preceded by const, the compiler is free to perform optimizations that assume that no code will ever change the values of this type after they are first initialized.

As a special syntax, a string literal like "hello" does two things:

It ensures there exists somewhere an array of characters {'h', 'e', 'l', 'l', 'o', 0}, typically in read-only memory.
- Note the 0 at the end (that’s byte-0 not character-0). This is how C knows the string is over.
It returns a const char * pointing to the h.

Storage Classes

Variable declarations can optionally have one or more storage class. The most important of these are:

const

Important enough to have it’s own section: Constant

static inside a function

Declaring a local variable as static makes it a global variable that only that one function can see.

Example:

int prev(int y) {
    static int q = 0; // only initialized once
    int ans = q;
    q = y;
    return ans;
}
int main() {
    printf("%d\n", prev(3));
    printf("%d\n", prev(1));
    // q = 5; // <- error, only prev can access q
    printf("%d\n", prev(4));
    printf("%d\n", prev(1));
}

prints

If you need a function to have persistent state, a static local is the preferred way to do that.

static for a global

Declaring a global variable as static makes it visible only inside that .c file.

If you need a few related functions to share some persistent state, a static global is the preferred way to do that. If just one function needs the state, use a static local instead.

extern for a global

Declaring a global variable as extern tells the compiler “assume it exists, don’t create it. It will be supplied by a different .c file. Connect the two during linking.”

In general, globals are declared as extern in .h files and re-declared without extern in exactly one .c file.

If you need a many functions to all access the same persistent state, use a non-static global declared as extern everywhere it is used except in one .c file. Non-static globals are associated with various bugs, so you should prefer a static global or static local instead where possible.

register, volatile, and restrict

These are hints to the compiler. They have no direct impact on code operation, but can make some optimizations work better.

register suggests that this variable be stored in a register, not in memory, and is ignored by many compilers.
volatile tells the optimizer to assume the variable is being changed by something other than the code itself and blocks certain kinds of optimizations that might otherwise assume differently.
restrict can only be used with pointers, and tells the compiler it is safe to assume that no other pointer describes the same memory as this pointer, enabling certain optimizations that might otherwise not be permitted.

Control constructs

Braces and scope

Any statement may be replaced with a sequence of statements inside braces. Variables declared inside a set of braces vanish at the end of those braces.

int x;
{
    int y;
    x = y;  /* OK, both x and y in scope */
}
y = x; /* ERROR: y is no longer in scope */

Flow of control

Nice and common ones

if

Any statement may be preceded by if ( ... ); the statement will only be executed if the expression inside the parentheses yields a non-zero value.

Any statement following a statement preceded by if ( ... ) may be preceded by else; the statement will only be executed if the expression inside the if’s parentheses yields a zero value.

while

Any statement may be preceded by while ( ... ); the statement will only be executed if the expression inside the parentheses yields a non-zero value, and will continue to be executed until that condition stops being true.

for

The special construct for (e1; e2; e3) s; is equivalent to the following:

{
    e1;
    while (e2) {
        s;
        e3;
    }
}

with a slight twist: if s contains a continue, it jumps to e3 instead of to while (e2).

If e2 is omitted, it is assumed to be 1, so for(;;) s; repeats s forever.

Ugly and uncommon ones

do-while

The syntax do s; while (e); means the same as s; while (e) s;: that is, it always does s once before first checking e. In my experience, this is used for less than 1% of loops.

label and goto

Any line of code may be preceded by a label: which is an identifier followed by a colon (e.g. some_label:).

The goto some_label; statement unconditionally jumps to the code identified by that label.

In 1968 Edgar Dijkstra write an article “Go To Statement Considered Harmful”. Since then, the use of goto in code has dropped significantly; it’s now usually a sign either of over-emphasis on optimization or a shim to avoid having to redesign poorly-organized code. However, there are a few situations where it can be handy, so it does sometimes show up in high-quality code.

switch

The switch statement in C may be implemented in several ways by the compiler, but it is designed to be a good match for the “jump table” approach.

The syntax of the switch is as follows:

switch(i) {
    case 0:
        statements;
        break;
    case 1:
        statements;
        break;
    case 3:
        statements;
        break;
    case 4:
        statements;
        break;
    default:
        statements;
}

Conceptually, this is

a block of code
with multiple labels
where the labels are numbered, not named

and it operates like the (invalid) code

c_code *targets[5] = { (case 0), (case 1), (default), (case 3), (case 4) };
if (0 <= i && i < 5) goto targets[i];
else goto default;

The break (as with a break in a loop) stops running the code block and goes to the first statement after it.

Many people think of a switch as being a nice way to write a long if/else if sequence, and are then annoyed by its limitations and quirks: it has to have an integer selector (as this is really an index), and it “falls through” to the next case if there is no break. Hence the following example, taken from wikipedia:

switch (age) {
  case 1:  printf("You're one.\n");              break;
  case 2:  printf("You're two.\n");              break;
  case 4:  printf("You're four.\n");
  case 5:  printf("You're four or five.\n");     break;
  default: printf("You're not 1, 2, 4 or 5!\n");
}

Because many programmers make mistakes with switch, it is common to see them banned by style, or augmented with a special style, or later languages to use a similar syntax in ways a jump table cannot handle, or mostly C-compatible languages augmenting them with rules like “each case must either end with break or with an explicit fallthrough/goto case”.

Most compilers have several different implementations of switch they can pick between; they might use a jump table, a sequence of if/else ifs, a binary search, etc.

Example: We can change a switch into gotos as follows

switch (age) {
  case 1:  printf("You're one.\n");              break;
  case 2:  printf("You're two.\n");              break;
  case 4:  printf("You're four.\n");
  case 5:  printf("You're four or five.\n");     break;
  default: printf("You're not 1, 2, 4 or 5!\n");
}

void *locations[] = {&&L1, &&L2, &&L6, &&L3, &&L4};
if (age < 1 || age > 4) goto L5; 
else goto *locations[age-1];
if (0) {
  L1: printf("You're one.\n");
      goto L6;
  L2: printf("You're two.\n");
      goto L6;
  L3: printf("You're four.\n");
  L4: printf("You're four or five.\n");
      goto L6;
  L5: printf("You're not 1, 2, 4 or 5!\n");
  L6: ;
}

Note that the &&label_name syntax is a GCC and CLang extension and not part of the official C language.

Functions

Most common use

The most common use of functions in C looks much like you are used to from other languages: a return type, a name, a list of typed parameters in parentheses, and a body in braces.

void baz(int i, char *b, float c) {
    b[i] = (char)c;
    return;
}

It is also common to declare functions before defining them, in part because C requires functions to be declared before use.

int is_even(unsigned n);
int is_odd(unsigned n);

int is_even(unsigned n) {
    if (n == 0) return 1;
    else        return is_odd(n - 1);
}

int is_odd(unsigned n) {
    if (n == 0) return 0;
    else        return is_even(n - 1);
}

Often the declarations or function headers are put in a separate file, called a header file and traditionally named with the suffix .h. The #include directive can thus grab all of these at once, simplifying coding without increasing the size of the resulting .c file or the compiled binary.

Syntax variations

C allows several variations to standard function syntax. Most of these are consider bad programming style.

Function return types can be omitted, defaulting to int:

min(int a, int b) { return a < b ? a : b; }

Function parameter types can be omitted, defaulting to int:
```
min(a, b) { return a < b ? a : b; }
```
Function parameter types can be specified between the ) and the {:
```
void baz(i, b, c)
int i;
char *b;
float c;
{
    b[i] = (char)c;
    return;
}
```
Technically, this does something called “promotion” and has various quirks; for this and other reasons it is often called “old-style” and generally discouraged.

A zero-argument function can be written as either

int three() {
    return 3;
}

int three(void) {
    return 3;
}

The main function (only) will return 0 if it is missing a return, and may omit its arguments upon definition.

It’s all convention

C passes arguments using a calling convention. This is obeyed blindly by both the caller and the callee; so if the caller thought the callee had different argument types than it did, neither will notice they have a problem; they’ll just silently do the wrong thing.

Example: Consider the following pair of files:

baz.c

long bar(char *);

/******* adds bar("hello") to its argument *******/
long baz(long x) {
    return bar("hello") + x;
}

bar.c

/** returns the requested suffix of "ten letter" */
char *bar(long x) {
    char *c = "ten letter";
    return c + (x%10);
}

When executed,

baz will put the address of the first character of "hello" into the %rdi register and then callq bar.
bar will look in %rdi for an integer, modulo it by 10, and use it to put an address of a character in the string "ten letter" into %rax
baz will look in %rax for an integer, add x to it, and return

This is almost certainly not what was wanted, but no part of it violates the rules.

Variadic functions

The number of arguments in a function is known as the function’s arity. Many functions have fixed arity, requiring the same number of arguments each time they are invoked, but sometimes it is nice to have a function that has variable arity, or a variadic function.

In C, when invoking a function of variable arity the invoking code simply follows the calling convention, putting some arguments in registers and others on the stack. The invoked function then needs to know how many arguments it received. Since it can’t tell anything without consulting at least one argument, all variadic functions in C require at least one argument, and almost all use that argument to decide how many (and what type) the other arguments are.

By far the most famous variadic function in C is printf, which is defined as

int printf(const char *format, ...);

Note the trailing ... means “this is a variadic function.” Thus, printf may be invoked with any arguments you want, as long as the first is a const char * (that is, a string):

printf("%s, %s %d, %.2d:%.2d\n", weekday, month, day, hour, min);

The printf function uses fairly involved rules about %s in its first argument to determine how many and what type the other arguments should be.

Writing a variadic function is somewhat complicated by the fact that the extra arguments do not have names. C provides (declared in stdarg.h) a special data type va_list and a set of special macros to use in accessing variadic arguments.

void va_start(va_list ap, argN);
type va_arg(va_list ap, type);
void va_end(va_list ap);

To use these, you might do something like

int sign_swaps(int num0, ...) {
    va_list ap;
    int last = num0;
    int ans = 0;

    va_start(ap, num0);
    while(last != 0) {
        int next = va_arg(ap, int);
        if ((last < 0) != (next < 0)) ans += 1;
        last = next;
    }
    va_end(ap);

    return ans;
}

If you want to write variadic functions, you should

Read all of man stdarg.h
Read all of man stdarg.h again, because you almost certainly missed something important
Look up variadic security vulnerabilities like the format string attack
Write good tests, including too-few- and too-many- and wrong-type-argument invocations.

Preprocessor

Before compilers compile code, they run the C Preprocessor. This does several uninteresting tasks like removing comments, but also processes various macros and directives.

`#include <somefile.h>`

Looks for somefile.h in the include path, a set of directories typically including /usr/include and sometimes a few others.

Upon finding the file, it dumps its entire contents into this part of the file, as if you had copy-pasted it here.

`#include "somefile.h"`

Looks for somefile.h in the current source directory and, if not found there, in the include path.

Upon finding the file, it dumps its entire contents into this part of the file, as if you had copy-pasted it here.

`#if expression`, `#else`, `#elif expression`, and `#endif`

Upon encountering an #if, the preprocessor evaluates the truth of the expression, which must contain only literals (and operators) because the preprocessor is not running code. If it is false, all code from that #if to the matching #else, #elif, or #endif is removed from the source code as if you had deleted it.

#else and #elif expression behave like else and else if (expression) would in C.

`#define NAME anything at all`

Defines an object-like macro. Anywhere NAME appears in the source code, this tells the preprocessor to replace it with anything at all—literally those exact tokens, as if you had done a global find-and-replace in your source file.

`#define NAME(a,b,c) anything including a and b and c`

Defines a function-like macro. Anywhere NAME(x,y,z) appears in the source code, this tells the preprocessor to replace it with anything including x and y and z—that is, it does a find-and-replace with some parameterization.

This is a lexical replacement, not a syntactic one, so you should almost always add parentheses around each argument and around the full expression:

#define TIMES2(x)  x * 2        /* bad practice */
#define TIMES2b(x) ((x) * 2)    /* good practice */

int x = ! TIMES2(2 + 3);   /* int x = ! 2 + 3 * 2;      (i.e., !2 + 6 == 6) */
int y = ! TIMES2b(2 + 3);  /* int x = ! ((2 + 3) * 2);  (i.e., !10 == 0) */

If you decide to become a C expert, there is more to know about macros; see https://en.wikipedia.org/wiki/C_preprocessor#Special_macros_and_directives for a reasonable overview.

`#ifdef NAME`, `#ifndef NAME`

These act like #if, except instead of checking if something is true they check if a name has been #defined (#ifdef) or not (#ifndef).

A very common use of these macros is to ensure only one copy of an .h file is included. For example, my_file.h might look like

#ifndef __MY_FILE_HAS_BEEN_INCLUDED__
#define __MY_FILE_HAS_BEEN_INCLUDED__

/* file contents here */

#endif

This way if a file #include "my_file.h" twice (as, for example, because it includes two other .h files that each #include my_file.h) then the first one will define __MY_FILE_HAS_BEEN_INCLUDED__ and the second one, seeing __MY_FILE_HAS_BEEN_INCLUDED__ is already defined, will have all its contents removed by the #ifndef.

Example: If we have something like

foo.h

#ifndef __FOO_H
#define __FOO_H
int foo;
#endif

foo.c

int x;
#include "foo.h"
int y;
#include "foo.h"
int z;

the #include processing will create

int x;
#ifndef __FOO_H
#define __FOO_H
int foo;

#endif
int y;
#ifndef __FOO_H
#define __FOO_H
int foo;

#endif
int z;

The first #ifndef is true, since __FOO_H had not been defineed before that line

int x;
#define __FOO_H
int foo;

int y;
#ifndef __FOO_H
#define __FOO_H
int foo;

#endif
int z;

That means that the second #ifndef is false, since the first defined __FOO_H

int x;
#define __FOO_H
int foo;

int y;
int z;

`FILE` and `LINE`

The preprocessor is guaranteed to define __FILE__ as an object-like macro expanding to the name of the current file, in quotes, like "my_file.c". The preprocessor is also guaranteed to define __LINE__ as an object-like macro expanding to the line number on which __LINE__ appears, like 23.

These are often used in debugging messages, as e.g. printf("Error in %s on line %d\n", __FILE__, __LINE__);.

Because the preprocessor redefines these on its own on each new line of code, they have a special #line directive to change them if you need to do that (not the usual #define). The #include processing and comment removing adds such #line directives so that it does not change source line numbers.

`#error "error message"`

Shows error message as an error message during compilation.

Most C compilers add several other compiler-specific preprocessor directives, like #warning, #pragma message, #pragma once, #include_next, #import, etc. Each is added to simplify some common task, but also makes code harder to port to other platforms.

Defined using typedef in <types.h>. ↩