Missing Parsers
© 26 Jul 2016 Luther Tychonievich
Licensed under Creative Commons: CC BY-NC-SA 3.0
other posts

A “‍solved‍” problem that isn’t.


In computing, parsing refers to the process by which the underlying structure of text is understood by the computer. This sense is surprisingly close to the English verb “‍parse‍” which has meant “‍identify the parts of speech of a sentence‍” for five hundred years. Parsing was tackled early in computing and solved with unusual thoroughness. Not only do we have a well-defined hierarchy of parsing ease, we also have many efficient algorithms at each of those levels (except the hardest few) most with documented strengths and weaknesses compared to other algorithms.

Two levels of the hierarchy are particularly common in computing today. These are named “‍regular‍” and “‍context-free‍” (both poor-quality names in my opinion, but too widespread to change now). Each of these have a well-understood method of expressing parsing problems (regular expressions and context-free grammars). Each has several implementation approaches and each of those approaches has been implemented multiple times in large, well-tested, mature software packages. But there are two differences worth highlighting.

Usable Availability

Etymology trivia: “‍available‍” and “‍valiant‍” share the same root, “‍valere‍”, referring to strength, worth, or ability. Stories sometimes refer to valiant heroes, but in my experience available heroes sufficient and rare enough that valiance is not even on my checklist.

Every “‍Every‍” is a bold claim, since I have learned more than six dozen languages in my time, but I really couldn’t find an exception. Even Octave and Sage, languages oriented almost entirely around mathematics, have a built-in regular parsers. major programming language I have used has built in to it a robust implementation of a regular parser. Often it is part of the “‍standard library‍”; other times it is wired into the very syntax of the language. Either way, if I need to parse, regular expressions are almost always available and easy to use.

The easy-to-use part is important. Regular expressions as commonly used today let me specify not just what kinds of things I want to parse but also how I want the information to be returned to me after parsing. This lets me concisely express ideas such as “‍replace any occurrence of ‘‍the‍’ which is its own word and is followed by a capitalized word with ‘‍The‍’ instead‍”: more concisely than I can in English, in fact: it’s just “‍replace \bthe(\s+[A-Z]) with The\1‍”

On the other hand, context free parsers are almost never part of the default installation of a language. You can usually find a third-party package for helping with context-free parsing, but often they have extra quirks such as only handling a subset of context-free languages, having a different way of expressing grammars than other parsers do, or requiring the entire grammar to be available at compilation time.

More significantly, most context-free parsers are painful to use. They tend to hand back more data than you want and fail to hand back some data you wish you had. Customizing their functionality often requires writing source code inside the grammar itself (generally with various restrictions on what you may write) or writing nontrivial code to take whatever the parser returned and transform it into what you wanted. Even a simple context-free task like “‍replace doubly-parenthesized expressions with singly-parenthesized forms‍” would be tricky to do in less than 50 lines of code.


Etymology trivia: “‍express‍” comes from “‍ex-‍” (out) and “‍press‍” (squeeze, push). I like the image this creates of shoving a lump of raw thought into a pasta press and squeezing out an expression of that thought: clean and presentable, but hardly the full original idea.

Humans have created a lot of context-free languages in the past fifty years. Almost every programming language and many data and markup formats are context free. Mathematical notation, as used for hundreds of years, is also context free once you get around the fact it is not linear in presentation. Human languages like English are not context-free, but then they also don’t follow any rules completely.

By contrast, very little is described by regular languages. You can express the idea of “‍a sequence of numbers separated by operators‍” using a regular expression

[0-9]+( *[-+] *[0-9]+)*

but add in parentheses and it is no longer regular, though it is context-free This example is neither the most concise nor most verbose variant of context-free grammars. The lack of standard tools has likely contributed to the lack of standard syntax.

e ::= n | n s /[-+]/ s e s ::= / */ n ::= /[0-9]+/ | "(" s e s ")"

Overgeneralizing, regular languages have the notion of sequence (number after number) but not containment It’s actually arbitrarily deep nested containment that isn’t usually regular, but in my experience humans tend to assume containment means nesting so the distinction rarely matters. (numbers inside parentheses).


Etymology trivia: “‍kludge‍” was coined for comedic effect by Jackson Granholm in 1962; it caught on in programming but not in most other fields.

Parsing is now in a state where many problems ought to be solved using tools that are hard to find or use effectively, but where other, not-quite-good-enough tools are readily available. This is the perfect recipe for tool misuse and has resulted in a lot of kludged-together code made up of complicated regular expressions and obtuse special-case-handlers, as well as a lot of written-by-hand custom parsers for context-free languages with all the attendant fragility and bug prevalence that such one-off implementations of non-trivial algorithms tend to create.

Thus my conclusion: parsing is not a solved problem. There are probably advances that can be made algorithmically I enjoy reading about latest advances in LL(k) and SGLR and Marpa and so on more than most, and have written several parsing algorithms of my own. That new algorithms are not needed does not mean they are not interesting. , but my main concern is that after decades of relatively stable grammars and algorithms, access to usable context-free parsing libraries remains rare.

As with so many other things, technical advances without good wrapping are of minimal benefit.

Looking for comments…

Loading user comment form…