Data on a computer

A few tools that can express almost anything.

Under the hood, computers access “‍memory.‍” Memory, in computing jargon Computer jargon is mostly based on English, not Latin. “‍Memory‍” is computer jargon for a particular class of ways of creating a gargantuan list of bytes. When speaking with someone about computing in English I might say “‍my memory’s not what it once was; how large was memory, again?‍” and assume they can tell that the first “‍memory‍” in that phrase was English while the second was jargon. This can be confusing to newcomers to the field. , is a gargantuan list of bytes. Very early in computer design, different hardware used different size bytes, but today all of them use octets. An octet is a number between 0 and 255 (inclusive). Computer hardware has built-in commands to access several consecutive octet as if they were a multi-digit base-256 number: for example, [103][221] = 103*256+221 = 26,586. They also have built-in ways to treat numbers as signed integers (e.g., treating a two-octet number as a value between −32,768 and 32,767 instead of as a value between 0 and 65,535) or as non-integer numbers (using something like scientific notation).

Note that we’ve already mentioned many possible meanings of a single octet. Maybe it’s a number in its own right; maybe it’s the first digit in a larger number, or the second digit in a larger number, and so on; and maybe that number should be left as is, or treated as a signed integer, or treated as a non-integer. How does the computer know which one to do? We tell it.

Computers follow instructions we give them, and there are different instructions for each of these interpretations. I can tell the computer “‍go to position 39,048,452 in the list of octets called ‘‍memory‍’ and read four octets as a non-integer number.‍” I can also tell it “‍go to position 39,048,454 in memory and read two octets as a non-negative integer.‍” There are two octets (those as positions 39,048,454 and 39,048,455) that both of these instructions will access, and will interpret in different ways. It is hard to come up with a scenario where I’d want to read the same octets as representing two different kinds of things, but that’s on me, the programmer, to figure out: the computer just blindly does what I tell it to do.

In computing jargon, the way we read a particular sequence of octets is called their “‍type‍” Again, “‍type‍” is jargon. What if I need to talk about different types of types? The jargon for that is “‍kinds of types‍”, and then “‍sorts of kinds of types‍”, though needing to talk about sorts of kinds of types is rare enough I’ve only used that jargon a few times in my life. . We often want more types than just numbers, and have established a small number of approaches to getting more types: indirection, enumeration, consecutiveness, and adjacency.

Indirection works by storing not the value itself but rather the position in memory where the value can be found.

Enumeration works by defining a mapping between numbers and some other concept. One of the best known of these is ASCII which maps single octet numbers to things needed to run a digital typewriter. For example, 10 means “‍go down one line‍”; 13 means “‍go to the beginning of the current line‍”; 65 through 90 mean “‍type a capital letter‍” with A = 65, B = 66, and so on up to Z; 97 through 122 type lower-case letters; and so on.

Consecutiveness lets us make lists of things by the simple expedient of placing each item in the list’s octets at consecutive locations in memory. We have two techniques for knowing how many things are in the list: we can have the length be a known value stored outside the list itself or we can use a special “‍end of list‍” value. ASCII, for example, uses 0 to mean “‍end of list‍” so that we can encode the entire list of actions we want a typewriter to do, regardless of how long that list is, and execute it by the following simple rule:

if octet in current position is 0 { stop } otherwise { tell the typewriter to do the thing this octet represents; move to the next position }

Combined with indirection, this means we can process arbitrarily long typeable content as a single number: the location in memory where we can find the start of the typeable content’s ASCII encoding. That, in turn, means I can make a list of typeable things by placing those starting locations (which in computer jargon are called “‍pointers‍”) adjacently. And I can operate on this rapidly: for example, if I use 4-octet numbers for each pointer and want to learn the 8^th typing action of the 382^nd typeable thing, then I read a 4-octet number at a location 4 × 382 = 1528 octets after the start of my list, add 8 to that and use it as a memory location to read a 1-octet value, which I then interpret as ASCII.

Adjacency is like consecutiveness in that it stores several things next to one another in memory, but it lets them each be their own type. For example, I could store a list of numbers of any length using two numbers stored next to one another in memory: the length of the list and a pointer to the list’s first value. I could store person in a simplified genetic family tree using a pointer to a list of characters representing their name, several numbers representing their birth date, and two more pointers, one to the person providing the egg and the other to the person providing the sperm that jointly became this person.

That is it. Virtually all data in all computers everywhere is made up of numbers, some interpreted as pointers or enumerates, combined consecutively or adjacently. There’s no law limiting data to this, and every once in a while you’ll bump into an outlier type like the the bitvector and the heapified array that uses some fifth technique, but the vast majority of data types, and by extension data itself, is just these few components carefully assembled.