Saying Anything with Bits

Information theory asserts that any information can be presented in any a format that contains enough bits. There are many interesting theorems and algorithms in that field about how this can best be done, but for most of computing we only need to consider a handful of simple encodings, described in the subsections below.

Numbering everything else

Not all data we care about is numeric. There are both non-numeric “primitive” types like characters and “aggregate” types made up of two or more values of other types.

When we want to introduce a primitive type, we need to define a sufficiently large number of bits and then create a mapping between each value of that type and a particular bit sequence. Since every bit sequence can be thought of as representing a number, we could call this process “enumeration,” but there does not need to be any obvious structure or meaning to how the bit patterns or their number meanings relate to the represented concept.

Character encoding

Text is pervasive in today’s world, and is thus often represented in binary. However, there is no one right way to represent characters as bits. Indeed, it is not easy to even decide what is a character and what is not; Unicode, one of the most extensive of the common character enumerations, has decided that R, Ｒ, ℛ, ℜ, and ℝ, are all different characters but that R, R, R, R, and R are all the same character.

Character encodings are complicated in part because there are so many of them and they have so many internal rules and special cases. However, each encodings is just a big table of associations like “The Angstrom sign (Å) is character number eight-thousand four-hundred ninety-one, or 10000100101011 in binary.” Designing a character encoding is a lot of work; detecting which one is being used can be tricky; and some use different numbers of bits for different characters, which adds an additional level of programming complexity. That said, most uses of characters can simply rely on some other piece of code to keep track of all the various decisions made by the encoding designers.

Custom encodings

It is common for programmers to wish to represent a set of primitive values other than numbers and characters. For example, OpenGL decided that it wanted a type to represent “kind of shape” and gave it ten values: a group of points is 0, a group of line segments is 1, a ring of contiguous line segments is 2, an open path of contiguous line segments is 3, and so on.

These kinds of custom encodings are very common, and are generally called “enumerations” or “enums.” We’ll see more on how to create and use enums later in this course.

Which comes first?

Processors handle numbers in binary. A binary number has a “high-order” bit and a “low-order” bit, with many bits in between. It does not intrinsically have a “first” or “last” bit; Sometimes we think of the high-order bit as “first” (for example, when we write it in English, we write the high-order bit first) and sometimes as “last” (for example, when we do math we get to the high-order bit last). Fortunately, we don’t need to care about the first-last distinction because the processor doesn’t interact with numbers in that way.

In general, computer memory handles numbers in bytes, base-256. A multi-byte number has a “high-order” byte and a “low-order” byte, generally with several bytes in between. Again, “first” and “last” are not intrinsic when describing these numbers. However, memory does represent each byte as having a location, like an index into a list or array. As such, the processor has to decide, when breaking a larger number into bytes, whether it is going to put the high-order or low-order byte in the first spot in memory. Other parts of computers, such as disks and network drivers, also need to be told bytes in a first-to-last order.

Processors (and people) are not consistent in what decision they make. Some, like x86, x86-64, addition, and Arabic numerals in the original Arabic, put the low-order byte first and are called “little-endian”. Others, like MIPS, PowerPC, comparison, and Arabic numerals in most European languages, put the low-order byte last and are called “big-endian”. A few, like ARM and students in classes like this, understand both and can be configured to act in either a big-endian or little-endian mode.

In English (and most other languages today) we use base-10 and Arabic numerals; thus thirty-five is written 35. The high-order digit is on the left, the low-order digit is on the right. When you put Arabic numbers inside of English text it thus looks big-endian because English is read left-to-right. However, when you put the same number inside of Arabic text it looks little-endian because Arabic is read right-to-left.

It is important to note that endianness applies only when breaking a primitive value into pieces and putting those pieces in some arbitrary order. Endianness does not apply inside the processor, where numbers are stored in whole without an arbitrary order. Endianness also does not apply to naturally-ordered values like the elements of a list, as these are placed in their natural order on both big- and little-endian computers.

Exercise: Consider an array (a list stored contiguously in memory) containing two 16-bit numbers, [0x1234, 0x5678]. If this array were placed in memory at address 0x108, we’d find the following in memory:

address	little-endian	big-endian
108	34	12
109	12	34
10a	78	56
10b	56	78

Observe that the order of bytes within each number changes based on endianness, but that all of the following are unimpacted by endianness:

Each number occupies two bytes.
The address of a number is the address of its “first” byte, whichever byte that is in the endianness being used.
The elements of an array are in the same order in memory as they are in code.
The second element’s address is 2 bytes past the first, at 0x10a.
It is bytes, not nibbles or bits, that are placed in memory, so we still see byte 0x34 (0b00110100) not nibble-reversed 0x43 (0b01000011) nor bit-reversed 0x2c (0b00101100).

Intentional redundancy

Although digital systems are much less sensitive to noise than are analog systems, sometimes noise still does impact the signal, and when it does the consequences are potentially quite large, particularly if the noise flips a high-order bit. In part because of this, it is common to include intentional redundancy in digital signals. By including information that could have been fully determined without its presence, it becomes possible to compare that information with what would normally have been generated and use that to see if the information has arrived in its intended form.

Detection versus correction

The simplest form of redundancy to design allows you to detect if a group of bits has been modified in small ways. The “small ways” caveat is important: there is no form of redundancy that is proof against large changes to the information, as the redundant data itself is encoded in bits and if those bits are changed to match the changes to other bits, the data will look consistent.

A single extra bit can be enough to detect that a bit got flipped due to noise. To be able to automatically correct that flipped bit in an n-bit signal requires at least log~2~(n) extra bits, as anything less than that has insufficient information content to identify which bit needs correction.

Exercise: Measure the detection and correction ability of redundancy in a human language of your choice.

Find a large corpus of example text (the larger the better)
Write a program that repeatedly
1. selects a random substring from the corpus, long enough to give a little context (perhaps 2 dozen words)
2. randomly either (a) leave it alone or (b) change one word in the central half of what you show
3. prompt the user to either identify the text as unchanged or changed, and if changed to identify what word was changed and what it was supposed to be
4. track the number of correct and incorrect identifications and corrections

While correction may seem desirable, often the easier detection criteria is sufficient because in many cases the recipient of detected-bad data can simply request the sender to send it again, repeating as needed until a good copy is obtained.

Parity, checksums, error-correction codes, and digests

There are many different techniques used to add structured redundancy to sets of bits. This section describes just a few.

Parity: The “parity” of a set of bits is 0 if an even number of the bits are 1s and 1 if an odd number of the bits are 1. Adding a parity bit (sometimes called a “check bit”) to a number is a simple way to make single-bit errors detectable.

Exercise: The following table shows 1-byte values and their check bit. Identify which ones have a parity error.

Byte	Check	Has parity error
00000000	0
01001101	1
11111000	0
10101010	0
00011100	1
01000010	1
11111111	1

Answers are in this footnote¹.

Multiple parity

If you store the parity of multiple groups of bits, you can use the intersections of the groups that have errors to determine which bit had the wrong value and correct it. For example, given 11001010 we could compute the following parity bits:

Partity of	is
`11001010`	`0`
`1100`	`0`
`11 10`	`1`
`1 0 1 1`	`1`

Changing any input bit will change a unique set of parity bits; for example, changing the second-most-significant bit will change the first three parity bits but not the fourth.

This is an example of how one could begin to design a correcting code, but it is not fully usable as is because an error in the parity bits themselves cannot be unambiguously identified and corrected. Full correctable redundancy is more complicated to design; common designs include convolution codes (based on Boolean polynomials) and Hamming codes (based on parity bits, but selected more carefully than the example above).

CRC

Many digital formats make use of a family of error detection algorithms collectively known as Cyclic Redundancy Checks or CRC. The IEEE has standardized (in IEEE 802-3) one, commonly called CRC32, that is used in common formats such as the PNG standard.

CRC’s are cyclic in the sense that they can be applied to any length input by repeating a simple procedure. This makes them one type of digest: an algorithm that accepts long inputs and produces small summary outputs. CRCs are similar to the other most common form of digest (called a hash), but are generally optimized for speed and for detecting bit-level changes. Other digests are used in security or memory allocation contexts and have different design goals.

Checksum

“Checksum” is a generic term for any easily-defined, easily-computed value that depends on a larger message. Parity, CRC, hashes, and many others are all examples of checksums. The word itself suggests another simple approach: treat each byte as a number and sum them all up.

There is not just one kind of checksum. Often when you see documentation refer to a “checksum” they either mean that they have a custom way of producing the error-detecting redundancy or that the specific method used is unimportant in the context of the documented features.

Although error detection and correction through redundant data is important for the success of all kinds of digital information, and commonly implemented in everything from computer memory chips to archive file formats, it is relatively uncommon for a programmer to need to think about checksums in detail. It is usually enough to know that messages may contain some error-checking information and that recipients may inform providers that a message arrived in a corrupted state.

error, correct, correct, error, error ↩