Regular Expressions

theory# Strings, Languages, Expressions

# Regular Expressions

# Why should I care?

A notation for describing the languages of Finite Automata.

Last week I discussed state machines; in particular I presented Deterministic Finite Automata, the simplest of the machines recognizing languages that form the subject matter of computability theory. In this post I discuss a common notation for describing the languages that these automata recognize, though I won’t show that that is what I am doing until a later post.

Recall that an alphabet
is a set of symbols, like {`a`

, `b`

, `#`

};
a string is an ordered list of symbols from that alphabet
like `aab#ba##`

,
and a language is a set of those strings,
like {`baa`

, `aba`

, `##a#b`

}.

Our goal today is to come up with a succinct notation
for describing languages that might contain infinitely many strings,
such as “strings with a `b`

after every `a`

”
or “strings with more `#`

s than `b`

s.”
We won’t succeed in creating a notation for “more `#`

s than `b`

s”
in this post…
We’ll call representations of this notation expressions
and display them `thus`

.

Let’s start simple.
Let’s let a sequence of symbols from our alphabet represent that sequence of symbols:
`a`

describes the language {`a`

},
`baa`

describes the language {`baa`

}, etc.
We don’t have any way to describe languages with more than a single string in them,
but we have to start somewhere.

Let’s add a symbol not in our language—say, `?`

—that means something else.
Let’s have it mean “You can include the previous symbol or not, both are ok.”
Thus
`ab?`

describes the language {`ab`

, `a`

},
`b?a?a?`

describes the language {`baa`

, `, `

`ba`

, `a`

, `b`

, `aa`

}, etc.
Our expressions are more succinct now, but still can’t express {`abb`

, `a`

}
Try it; you’ll end up with {`abb`

, `ab`

, `a`

} or {`abb`

, `ab`

}.
.

So let’s add more symbols, `(`

and `)`

,
and let them group things together.
`a(bb)`

describes the language {`a`

},
`abb?`

describes the language {`ab`

, `abb`

},
`a(bb)?`

describes the language {`a`

, `abb`

},
etc.
This adds to what we can express, but we’re still missing things like {`a`

, `b`

, `#`

}.

Let’s add another symbols, `|`

.
This will mean “either the thing on the left, or the thing on the right, but not both”.
`aba|bb`

describes the language {`aba`

, `bb`

},
`a(b|a)`

describes the language {`ab`

, `aa`

},
`a(b|a)?`

describes the language {`a`

, `ab`

, `aa`

},
etc.
Now we can express any finite language,
but we still don’t have infinite languages like {`a`

, `aa`

, `aaa`

, …}.

There’s one more symbol we need to make our expressions “regular”: `*`

.
A `*`

is very similar to a `?`

,
but where `a?`

means zero or one `a`

s,
`a*`

means any number of `a`

s, zero or more.
`aa*`

describes the language {`a`

, `aa`

, `aaa`

, …},
`(b|a)*`

describes the language consisting of any number of `a`

s and `b`

s in any order,
etc.
Some of these languages are hard to describe any other way:
for example, `a(bab|aaa?)*b`

describes the set of strings where each string starts with an `a`

, ends with a `b`

,
and in between has any number of `bab`

, `aa`

, and `aaa`

in any order.

There are a variety of shorthands for regular expressions:
for example, many use the symbol `+`

to mean “one or more” instead of `*`

’s “zero or more”;
but all of these are just for convenience, not expressive power.

Regular expressions exist “in the wild”:
in most word processors and text editors, for example,
when you search you can choose to search for regular expressions
instead of just simple strings.
But if I were going for that I would have taught you about
some of the more convenient extras they use
such as `.`

and `[`

and `]`

.
You are welcome to read up on practical regular expressions elsewhere:
there are plenty of good tutorials accessible through any search engine.

This is a post about theory. One of the basic theoretic questions we will ask is given two ways of representing a language (in our case regular expressions and DFAs) can we tell if the two are inter-changeable? Another, perhaps more interesting, question is are there languages that cannot be expressed using these techniques? These and others will be investigated in future posts.

Loading user comment form…

Looking for comments…