The Rise of Text-Based File Formats

How the rise of the universal editor changed file formats.

The previous two posts discussed data in memory and file formats, including discussing why file formats are very varied in form.

This post attempts to explain why many, many file formats don’t use that variety.

Since very early in the history of computing, one of the most complicated kinds of data—computer programs—have been represented as text: that is to say, a file format that is simply a list of characters. The motivation for this is multifold, but a bit part is that text input and output are readily available. A keyboard and a terminal are all that are needed, so when a new computer language comes along, no new tooling is needed: anyone can write programs in the language using the tools at hand.

The text-based nature of programming languages has done much to define the development of programming. On one hand, tool developers focus on how to add optional but useful extras on top of the basic text editor. On the other, programming language developers focus on how to better express in text the components of good programs. With a standard interface in between the two, both can develop independently. Both are also separated by and bound to this interface, which can be argued to be problematic, but that’s the topic for another day.

Many file formats started out as text, for the same reason: every computer has the tools to edit text. When trying to debug, refine, or test file formats it is very convenient to be able to open files in an editor, see what they look like, and adjust them. But text is not very compact and takes time to parse into in-memory data, so for many years the prototype text formats were later replaced with more compact and more efficient non-text formats.

Two other purposes have used text-based files since early in computing. One is configuration files: each is small and edited rarely but there are many of them, making the benefit of using readily-available tet editors far outweigh the inefficiencies of text. The other is markup for mostly-text data, from the 1964 RUNOFF to the ubiquitous current HTML.

To the best of my knowledge, the first text-based general-purpose data language was the one used by and documented with Personal Ancestral File 2.0 in 1985, which would later be refined and released as GEDCOM 3.0 GEDCOM has no version before 3.0; those numbers were used for internal development and never released. . This system had a simple, text-based format that had direct representation of numbers, text, pointers, lists, and composite types. It was designed to be both directly editable in a text editor and a suitable format for simple software to operate on when only small parts of it could fit in memory. However, it was posed in the context of a genealogical application and, while still ubiquitous in that domain, was never adopted elsewhere.

The next text-based general-purpose data language is probably XML. XML descended from SGML, which descended from GML, which was a text markup language for adding bold and headings and enumerated lists in text, and that origin is evident in XML today: its basic structure is to tag regions of text with arbitrary metadata. Many formats use it with almost no text, only tags, which are flexible enough to encode text, numbers, lists, and composite types. XML has no native ability to handle indirection, a significant limitation that has half a dozen workarounds.

It is hard to overstate the impact of XML. If you have a file format made in the past fifteen years, there’s a good chance it’s either XML itself or a zip with a few XML files inside it. Documents, spreadsheets, images, 3D models, citations, apps, configurations, window layouts, image metadata: XML is used everywhere. If the XML is too big, we simply zip it: it’s still text underneath.

XML has its detractors. Unlike GEDCOM, it was not designed to be a general purpose data language but instead an advanced markup language. Its repurposing as a data language leads to various quirks, some of which can trip up even careful programmers.

The currently most-popular text-based general-purpose data language is JSON. JSON is a subset of the way the programming language JavaScript expresses data Javascript’s 1995 data notation was the same as Python’s 1991 notation; Python’s was almost the same as ABC’s 1987 notation. Somewhat similar notation can also be found in several other languages of that era . As such, it can represent numbers, text, lists, and composites. However, like XML it does not have pointers, so each user of JSON encodes indirection in its own way.

Text-based general purpose data languages owe much of their success to the decreasing cost of disk space, the increasing effectiveness of compression, and the huge benefit of letting those who wish to do so look at and edit the guts of the files using text editors, tools everyone owns. JSON and XML are particularly successful because they were backed by major software vendors: there is value in using what other people use. GEDCOM, while older and technically superior to both in having built-in pointers, has not seen nearly as much buy-in. Many others have risen since, like SDLang, PYX, OGML, and Turtle. A few, like YAML and TOML, are starting to make inroads into the space currently dominated by XML and JSON.

A few text-based general purpose data languages, like YAML, Turtle, and JSON-LD, have a native notion of pointer but do not distinguish between composites stored with adjacency and composites stored with indirection. There’s a reason for treating them the same: in memory data is found using arithmetic, so indirection helps keep things easy to find; in text, data is found by looking for visually distinct patterns like indentation and reserved glyph sequences, so indirection is reserved for repeated use and footnote-like asides. To the best of my knowledge, GEDCOM is the only text-based general purpose data language that handles both adjacency and indirection separately and completely.

Non-text file formats remain today. General-purpose non-text file formats are also starting to gain popularity, including SQLite, BSON, MessagePack and its controversial clone CBOR, Protocol Buffers, FlatBuffer, and FlexBuffer. I’m quite fond of some of these and use them when I need small, fast-to-parse file formats. But for most uses, text-based file formats win the day. The knowledge that I can open them up and see and change what’s inside if I need to is something I’ll gladly sacrifice a little disk space and a little file loading time to gain.