Data in a file

Why files are not just copies of memory.

Last post I described how data is stored in memory, where memory is just a sequence of octets. Files on disk are also sequences of octets, and we also want to store data in files, so it might seem like we could simply copy memory into a file and save work in any program. Technically we can do that, but we almost never do.

The memory used by a computer program stores much more than just the data we wish to save. Some of this is visible: window size, view preferences, undo history, cursor position, and so on. Much of it is invisible, bookkeeping information used to execute the program and cached copies of computations to speed things up. Storing this information to the file would waste disk space, annoy users by changing their preferences incorrectly, and possibly leak private information.

So at a minimum, files need to be selective in what they store. That means removing some data, which means moving the rest of the data to not leave big gaps, which means changing most pointers, which always makes this a non-trivial process.

Memory is also often missing some information that we want to store in our data. Some information like the types of values or the lengths of lists might have been implicit in the programmer’s mind and not stored anywhere in memory. Some might have been encoded inside the instructions, rather than the data, and while technically in memory not in memory in a useful way. So in addition to removing data, we may need to add some too.

Data is often organized in memory in a less-than-compact way in order to make operating on the data faster. Files are often organized to be small, even if that means they take longer to load from and save to. Removing indirection, blank space, and redundant information is commonly part of creating a file. Full time-consuming compression algorithms are often added on top of that. Sometimes new redundancy is added that is designed explicitly to detect if the file was loaded correctly.

The end result of this is that, while the underlying building blocks of data are the same for memory and files, they often look very different from one another.

Let’s look at a few examples:

Some file formats are designed to basically be programs that, if executed, result in the data. Postscript was a popular example, though its descendant PDF has become more popular since and removed most of its programmatic parts of postscript. Pickle is another example, used internally by Python to save files with all of the potential complexities of Python’s built-in data types.

Some file formats present a puzzle that, when solved, result in the data. JPEG, for example, stores many things but the bulk of them are a set of trigonometry functions that, when solved, give result in a grid of pixel colors. Zip is solved by various shuffling and duplicating actions resulting in an entire directory tree of other files. Most compression techniques are like this: we pick a puzzle family that computers can solve and that tend to have smaller puzzles for desirable larger results.

Some file formats are just the data you’d store in memory, stripped of excess stuff and put in a file. Txt, also called “‍plain text‍” is a classic example: the entire file consists of just the octets you’d put in a list in memory, where the type of the values in that list is an enumeration type called a “‍character encoding‍”.

Many files have some metadata along with the data. BMP, for example, is mostly the octets used by early screens to store small images, but there were several such formats so it starts with a few octets identifying itself as a BMP file and identifying which format its data is in. That’s true of JPEG and other image formats as well: there’s some identifying information and some data about the way the rest of the file is organized. Once format metadata is added, informative metadata is often added; all major image formats let you add in copyright statements, notes about what’s depicted in the image, and so on.

On-file data is more varied in structure than in-memory data in part because it is not limited by its applications. In-memory data aims to be efficient for computer hardware, which has only a limited set of operations, to use and easy for general-purpose programming languages, which humans learn quickly and keep in their heads, to describe. File formats also have to be handled by computers and programmers, but because they’re not on the time-critical path those constraints are less impactful.