How to remove significant amounts of redundancy.
This post took longer to write than I expected. How do you explain a topic so far outside the normal experience of people? I didn’t end up with a polished post, but I hope it is followable.
There are roughly two kinds of compression. In one kind you make the expected data small; for example, in Morse code “e” is less than 25% the size of “q”. In the other kind you try to remove redundancy so that there is no wasted space, every bit matters completely.
For years I’ve been writing this second kind of compression following a simple rule that I developed bit-by-bit. I have no idea if this rule has been published somewhere; very likely it has a name and someone will be upset I didn’t cite their work. But since I don’t know who they are and don’t care to plow through the literature of a field foreign to my usual focus to find them, I’ll just post it as is.
There are four steps in my compression and decompression-building rule.
First, take whatever it is you are compressing and turn it into a sequence of tokens. Some sequences make the later steps much easier than others, but in this example I’m going to use text which has a natural representation as a sequence of letters or words.
Second, come up with a way to determine, for every sequence, the set of tokens that could be appended to it. For example, in one of my dictionaries if I have “alp” the next letter can be one of “a”, “e”, “h”, “i”, or “s”; or the word could end right there with “alp”. If I have “prandi” the only next token allowed by that dictionary is “a” and the word cannot stop at “prandi.”
Third, number the options at each step sequentially starting with 0. It is important that, if the sequence cannot stop where it is, then appending token 0 repeatedly will eventually result in a sequence that can stop. Thus, for the set of odd numbers token 0 should not be the digit “0” because appending “0”s to an even number never creates an odd number.
Fourth, I use the following two routines to actually do the compression and decompression. The result is an integer between 0 and the number of things that can be represented. In general, shorter sequences have smaller numbers.
As an example, consider Dana’s nth-great grandmothers. Valid descriptions begin with “Dana”, end with “’s mother”, and have an arbitrary number of “’s father” and “’s mother” in between. Since token 0 needs to move towards a terminating point, we’ll call “’s father” token 1 and and “’s mother” token 0.
To compress “Dana’s father’s mother’s mother”,
To decompress 5 I’d do
It should be noted that this method is not perfect. In particular, some numbers are not generated by any encoding if there are sequences that have no possible successors. Thus, my dictionary of 173,540 words compresses the word “compression” as 58,889,155. I expect I’ll continue to refine and shrink the process over the coming years. But for right now it’s the best I’ve got.