A notation for describing the languages of Finite Automata.

Last week I discussed state machines; in particular I presented Deterministic Finite Automata, the simplest of the machines recognizing languages that form the subject matter of computability theory. In this post I discuss a common notation for describing the languages that these automata recognize, though I won’t show that that is what I am doing until a later post.

Strings, Languages, Expressions

Recall that an alphabet is a set of symbols, like {a, b, #}; a string is an ordered list of symbols from that alphabet like aab#ba##, and a language is a set of those strings, like {baa, aba, ##a#b}.

Our goal today is to come up with a succinct notation for describing languages that might contain infinitely many strings, such as “‍strings with a b after every a‍” or “‍strings with more #s than bs.‍” We won’t succeed in creating a notation for “‍more #s than bs‍” in this post… We’ll call representations of this notation expressions and display them thus.

Regular Expressions

Let’s start simple. Let’s let a sequence of symbols from our alphabet represent that sequence of symbols: a describes the language {a}, baa describes the language {baa}, etc. We don’t have any way to describe languages with more than a single string in them, but we have to start somewhere.

Let’s add a symbol not in our language—say, ?—that means something else. Let’s have it mean “‍You can include the previous symbol or not, both are ok.‍” Thus ab? describes the language {ab, a}, b?a?a? describes the language {baa, , ba, a, b, aa}, etc. Our expressions are more succinct now, but still can’t express {abb, a} Try it; you’ll end up with {abb, ab, a} or {abb, ab}. .

So let’s add more symbols, ( and ), and let them group things together. a(bb) describes the language {a}, abb? describes the language {ab, abb}, a(bb)? describes the language {a, abb}, etc. This adds to what we can express, but we’re still missing things like {a, b, #}.

Let’s add another symbols, |. This will mean “‍either the thing on the left, or the thing on the right, but not both‍”. aba|bb describes the language {aba, bb}, a(b|a) describes the language {ab, aa}, a(b|a)? describes the language {a, ab, aa}, etc. Now we can express any finite language, but we still don’t have infinite languages like {a, aa, aaa, …}.

There’s one more symbol we need to make our expressions “‍regular‍”: *. A * is very similar to a ?, but where a? means zero or one as, a* means any number of as, zero or more. aa* describes the language {a, aa, aaa, …}, (b|a)* describes the language consisting of any number of as and bs in any order, etc. Some of these languages are hard to describe any other way: for example, a(bab|aaa?)*b describes the set of strings where each string starts with an a, ends with a b, and in between has any number of bab, aa, and aaa in any order.

There are a variety of shorthands for regular expressions: for example, many use the symbol + to mean “‍one or more‍” instead of *’s “‍zero or more‍”; but all of these are just for convenience, not expressive power.

Why should I care?

Regular expressions exist “‍in the wild‍”: in most word processors and text editors, for example, when you search you can choose to search for regular expressions instead of just simple strings. But if I were going for that I would have taught you about some of the more convenient extras they use such as . and [ and ]. You are welcome to read up on practical regular expressions elsewhere: there are plenty of good tutorials accessible through any search engine.

This is a post about theory. One of the basic theoretic questions we will ask is given two ways of representing a language (in our case regular expressions and DFAs) can we tell if the two are inter-changeable? Another, perhaps more interesting, question is are there languages that cannot be expressed using these techniques? These and others will be investigated in future posts.