Regular Expressions
other posts

theory

A notation for describing the languages of Finite Automata.

Last week I discussed state machines; in particular I presented Deterministic Finite Automata, the simplest of the machines recognizing languages that form the subject matter of computability theory. In this post I discuss a common notation for describing the languages that these automata recognize, though I won’t show that that is what I am doing until a later post.

# Strings, Languages, Expressions

Recall that an alphabet is a set of symbols, like {`a`, `b``#`}; a string is an ordered list of symbols from that alphabet like `aab#ba##`, and a language is a set of those strings, like {`baa`, `aba`, `##a#b`}.

Our goal today is to come up with a succinct notation for describing languages that might contain infinitely many strings, such as “‍strings with a `b` after every `a`‍” or “‍strings with more `#`s than `b`s.‍” We won’t succeed in creating a notation for “‍more `#`s than `b`s‍” in this post… We’ll call representations of this notation expressions and display them `thus`.

# Regular Expressions

Let’s start simple. Let’s let a sequence of symbols from our alphabet represent that sequence of symbols: `a` describes the language {`a`}, `baa` describes the language {`baa`}, etc. We don’t have any way to describe languages with more than a single string in them, but we have to start somewhere.

Let’s add a symbol not in our language—say, `?`—that means something else. Let’s have it mean “‍You can include the previous symbol or not, both are ok.‍” Thus `ab?` describes the language {`ab`, `a`}, `b?a?a?` describes the language {`baa`, , `ba`, `a`, `b`, `aa`}, etc. Our expressions are more succinct now, but still can’t express {`abb`, `a`} Try it; you’ll end up with {`abb`, `ab`, `a`} or {`abb`, `ab`}. .

So let’s add more symbols, `(` and `)`, and let them group things together. `a(bb)` describes the language {`a`}, `abb?` describes the language {`ab`, `abb`}, `a(bb)?` describes the language {`a`, `abb`}, etc. This adds to what we can express, but we’re still missing things like {`a`, `b`, `#`}.

Let’s add another symbols, `|`. This will mean “‍either the thing on the left, or the thing on the right, but not both‍”. `aba|bb` describes the language {`aba`, `bb`}, `a(b|a)` describes the language {`ab`, `aa`}, `a(b|a)?` describes the language {`a`, `ab`, `aa`}, etc. Now we can express any finite language, but we still don’t have infinite languages like {`a`, `aa`, `aaa`, …}.

There’s one more symbol we need to make our expressions “‍regular‍”: `*`. A `*` is very similar to a `?`, but where `a?` means zero or one `a`s, `a*` means any number of `a`s, zero or more. `aa*` describes the language {`a`, `aa`, `aaa`, …}, `(b|a)*` describes the language consisting of any number of `a`s and `b`s in any order, etc. Some of these languages are hard to describe any other way: for example, `a(bab|aaa?)*b` describes the set of strings where each string starts with an `a`, ends with a `b`, and in between has any number of `bab`, `aa`, and `aaa` in any order.

There are a variety of shorthands for regular expressions: for example, many use the symbol `+` to mean “‍one or more‍” instead of `*`’s “‍zero or more‍”; but all of these are just for convenience, not expressive power.

# Why should I care?

Regular expressions exist “‍in the wild‍”: in most word processors and text editors, for example, when you search you can choose to search for regular expressions instead of just simple strings. But if I were going for that I would have taught you about some of the more convenient extras they use such as `.` and `[` and `]`. You are welcome to read up on practical regular expressions elsewhere: there are plenty of good tutorials accessible through any search engine.

This is a post about theory. One of the basic theoretic questions we will ask is given two ways of representing a language (in our case regular expressions and DFAs) can we tell if the two are inter-changeable? Another, perhaps more interesting, question is are there languages that cannot be expressed using these techniques? These and others will be investigated in future posts.