Regular Expressions
© 9 Nov 2011 Luther Tychonievich
Licensed under Creative Commons: CC BY-NC-ND 3.0
other posts


A notation for describing the languages of Finite Automata.


Last week I discussed state machines; in particular I presented Deterministic Finite Automata, the simplest of the machines recognizing languages that form the subject matter of computability theory. In this post I discuss a common notation for describing the languages that these automata recognize, though I won’t show that that is what I am doing until a later post.

Strings, Languages, Expressions

Recall that an alphabet is a set of symbols, like {a, b#}; a string is an ordered list of symbols from that alphabet like aab#ba##, and a language is a set of those strings, like {baa, aba, ##a#b}.

Our goal today is to come up with a succinct notation for describing languages that might contain infinitely many strings, such as “‍strings with a b after every a‍” or “‍strings with more #s than bs.‍” We won’t succeed in creating a notation for “‍more #s than bs‍” in this post… We’ll call representations of this notation expressions and display them thus.

Regular Expressions

Let’s start simple. Let’s let a sequence of symbols from our alphabet represent that sequence of symbols: a describes the language {a}, baa describes the language {baa}, etc. We don’t have any way to describe languages with more than a single string in them, but we have to start somewhere.

Let’s add a symbol not in our language—say, ?—that means something else. Let’s have it mean “‍You can include the previous symbol or not, both are ok.‍” Thus ab? describes the language {ab, a}, b?a?a? describes the language {baa, , ba, a, b, aa}, etc. Our expressions are more succinct now, but still can’t express {abb, a} Try it; you’ll end up with {abb, ab, a} or {abb, ab}. .

So let’s add more symbols, ( and ), and let them group things together. a(bb) describes the language {a}, abb? describes the language {ab, abb}, a(bb)? describes the language {a, abb}, etc. This adds to what we can express, but we’re still missing things like {a, b, #}.

Let’s add another symbols, |. This will mean “‍either the thing on the left, or the thing on the right, but not both‍”. aba|bb describes the language {aba, bb}, a(b|a) describes the language {ab, aa}, a(b|a)? describes the language {a, ab, aa}, etc. Now we can express any finite language, but we still don’t have infinite languages like {a, aa, aaa, …}.

There’s one more symbol we need to make our expressions “‍regular‍”: *. A * is very similar to a ?, but where a? means zero or one as, a* means any number of as, zero or more. aa* describes the language {a, aa, aaa, …}, (b|a)* describes the language consisting of any number of as and bs in any order, etc. Some of these languages are hard to describe any other way: for example, a(bab|aaa?)*b describes the set of strings where each string starts with an a, ends with a b, and in between has any number of bab, aa, and aaa in any order.

There are a variety of shorthands for regular expressions: for example, many use the symbol + to mean “‍one or more‍” instead of *’s “‍zero or more‍”; but all of these are just for convenience, not expressive power.

Why should I care?

Regular expressions exist “‍in the wild‍”: in most word processors and text editors, for example, when you search you can choose to search for regular expressions instead of just simple strings. But if I were going for that I would have taught you about some of the more convenient extras they use such as . and [ and ]. You are welcome to read up on practical regular expressions elsewhere: there are plenty of good tutorials accessible through any search engine.

This is a post about theory. One of the basic theoretic questions we will ask is given two ways of representing a language (in our case regular expressions and DFAs) can we tell if the two are inter-changeable? Another, perhaps more interesting, question is are there languages that cannot be expressed using these techniques? These and others will be investigated in future posts.

Looking for comments…

Loading user comment form…