A notation for describing the languages of Finite Automata.
Last week I discussed state machines; in particular I presented Deterministic Finite Automata, the simplest of the machines recognizing languages that form the subject matter of computability theory. In this post I discuss a common notation for describing the languages that these automata recognize, though I won’t show that that is what I am doing until a later post.
Recall that an alphabet
is a set of symbols, like {a, b, #};
a string is an ordered list of symbols from that alphabet
like aab#ba##,
and a language is a set of those strings,
like {baa, aba, ##a#b}.
Our goal today is to come up with a succinct notation
for describing languages that might contain infinitely many strings,
such as “strings with a b after every a”
or “strings with more #s than bs.”
We won’t succeed in creating a notation for “more #s than bs”
in this post…
We’ll call representations of this notation expressions
and display them thus.
Let’s start simple.
Let’s let a sequence of symbols from our alphabet represent that sequence of symbols:
a describes the language {a},
baa describes the language {baa}, etc.
We don’t have any way to describe languages with more than a single string in them,
but we have to start somewhere.
Let’s add a symbol not in our language—say, ?—that means something else.
Let’s have it mean “You can include the previous symbol or not, both are ok.”
Thus
ab? describes the language {ab, a},
b?a?a? describes the language {baa, , ba, a, b, aa}, etc.
Our expressions are more succinct now, but still can’t express {abb, a}
Try it; you’ll end up with {abb, ab, a} or {abb, ab}.
.
So let’s add more symbols, ( and ),
and let them group things together.
a(bb) describes the language {a},
abb? describes the language {ab, abb},
a(bb)? describes the language {a, abb},
etc.
This adds to what we can express, but we’re still missing things like {a, b, #}.
Let’s add another symbols, |.
This will mean “either the thing on the left, or the thing on the right, but not both”.
aba|bb describes the language {aba, bb},
a(b|a) describes the language {ab, aa},
a(b|a)? describes the language {a, ab, aa},
etc.
Now we can express any finite language,
but we still don’t have infinite languages like {a, aa, aaa, …}.
There’s one more symbol we need to make our expressions “regular”: *.
A * is very similar to a ?,
but where a? means zero or one as,
a* means any number of as, zero or more.
aa* describes the language {a, aa, aaa, …},
(b|a)* describes the language consisting of any number of as and bs in any order,
etc.
Some of these languages are hard to describe any other way:
for example, a(bab|aaa?)*b
describes the set of strings where each string starts with an a, ends with a b,
and in between has any number of bab, aa, and aaa in any order.
There are a variety of shorthands for regular expressions:
for example, many use the symbol +
to mean “one or more” instead of *’s “zero or more”;
but all of these are just for convenience, not expressive power.
Regular expressions exist “in the wild”:
in most word processors and text editors, for example,
when you search you can choose to search for regular expressions
instead of just simple strings.
But if I were going for that I would have taught you about
some of the more convenient extras they use
such as . and [ and ].
You are welcome to read up on practical regular expressions elsewhere:
there are plenty of good tutorials accessible through any search engine.
This is a post about theory. One of the basic theoretic questions we will ask is given two ways of representing a language (in our case regular expressions and DFAs) can we tell if the two are inter-changeable? Another, perhaps more interesting, question is are there languages that cannot be expressed using these techniques? These and others will be investigated in future posts.
Looking for comments…