Perspective.
In the 1948 landmark paper
A Mathematical Theory of Communication,
Claude Shannon founded the field of information theory and
revolutionized the telecommunications industry, laying the groundwork for today's
Information Age.
In this paper, Shannon proposed using a Markov chain to create a statistical
model of the sequences of letters in a piece of English text.
Markov chains are now widely used in
speech recognition, handwriting recognition, information retrieval,
data compression, and spam filtering.
They also have many scientific computing applications including:
the genemark algorithm for gene prediction, the Metropolis
algorithm for measuring thermodynamical properties,
and Google's PageRank algorithm for Web search.
For this assignment, you will consider a whimsical variant: generating
stylized pseudo-random text.
Markov model of natural language.
Shannon approximated the statistical structure of a piece of text using
a simple mathematical model known as a Markov model.
A Markov model of order 0 predicts that each letter in the
alphabet occurs with a fixed probability.
We can fit a Markov model of order 0 to a specific piece of text by
counting the number of occurrences of each letter in that text,
and using these counts as probabilities.
For example, if the input text is "agggcagcgggcg", then
the Markov model of order 0 predicts that each letter is
'a' with probability 2/13,
'c' with probability 3/13,
and
'g' with probability 8/13. The following sequence of letters
is a typical example generated from this model.
a g g c g a g g g a g c g g c a g g g g ...
An order 0 model assumes that each letter is chosen independently.
This does not coincide with the statistical properties of English text
since there is a high correlation among successive letters in an
English word or sentence.
For example,
the letters 'h' and 'r' are much more likely to
follow 't' than either 'c' or 'x'.
We obtain a more refined model by allowing the probability of
choosing each successive letter to depend on the preceding letter or letters.
A Markov model of order k predicts that each letter occurs with
a fixed probability, but that probability can depend on the previous
k consecutive letters (k-gram).
For example, if the text has 100 occurrences of "th",
with 60 occurrences of "the",
25 occurrences of "thi",
10 occurrences of "tha",
and 5 occurrences of "tho", the Markov model of order 2 predicts
that the next letter following the 2-gram "th" is
'e' with probability 3/5,
'i' with probability 1/4,
'a' with probability 1/10, and
'o' with probability 1/20.
A brute force solution.
Claude Shannon proposed a brute force scheme to generate text according to
a Markov model of order 1.
To construct [an order 1 model] for example, one opens a book at random and
selects a letter at random on the page.
This letter is recorded. The book is then opened to another page and
one reads until this letter is encountered. The succeeding letter is then
recorded. Turning to another page this second letter is
searched for and the succeeding letter recorded, etc. It would be
interesting if further approximations could be constructed, but the labor
involved becomes enormous at the next stage.
Your task is write an efficient Java program to automate this laborious task.
Shannon's brute force approach
yields a statistically correct result, but, as Shannon observed, it is
prohibitively slow when the size of the input text N is large.
An alternate approach is to create a "Markov chain" and simulate a
trajectory through it. This approach is described below.
Do not be afraid.
Help is here.
Markov chain.
For the purpose of this assignment, a Markov chain is comprised of
a set of states, one distinguished state called the start state,
and a set of transitions from one state to another.
The figure below illustrates a Markov chain with 5 states and 14 transitions.

Simulating a Markov chain.
To simulate a trajectory through the Markov chain,
begin at the start state (the one with an incoming arrow from nowhere).
At each step select one of the leaving arcs uniformly at random, and move
to the neighboring state.
For example, if the Markov chain is in state bab, then it will
transition to state abb with probability 3/4 and to state
aba with probability 1/4.
The following are the first ten steps of a possible trajectory beginning
from bbb:
bbb -> bbb -> bba -> bab -> abb -> bbb -> bbb -> bba -> bab -> abb
Markov chain data type.
Create a data type MarkovChain to represent a Markov chain of strings.
The data type must have the following three public methods:
- void addTransition(String v, String w): add a transition from state
v to state w.
- String next(String v): pick a transition leaving state v
uniformly at random, and return the resulting state.
- String toString(): return a string representation of this Markov
chain.
Your implementation of MarkovChain will be similar to
Graph.java, except
that (i) the Markov chain is directed, (ii) it allows parallel arcs (more than
one edge between two vertices), and (iii) it's fine to assume the states
are strings (instead of generic items).
Accordingly, consider using a symbol table of random queues
(key = string, value = random queue of strings) instead of a
symbol table of sets (key = vertex, value = set of neighboring vertices).
Your RandomQueue data type must store items of a generic type
and have the following three public methods:
- void add(Item item): add the given item to the list.
- Item sample(): return a random item from the list.
- String toString(): return a string representation of
this queue.
Confused? Help is here.
Text generation.
Implement a client program TextGenerator.java that takes
two command line inputs k and M,
reads the text from standard input, builds the Markov chain associated
with the order k Markov model, and prints out M
pseudo-random characters by simulating a trajectory.
- Building the Markov chain.
Include a state for each of the distinct
k-grams that occur in the text.
Include exactly N transitions: one from each
k-gram in the text to the next overlapping
k-gram.
Treat the text as a cyclic string to handle the boundary cases.
For example, if k = 3
and the text consists of
bbbabbabbbbaba, then the first two transitions
are from bbb to bba and from bba to bab,
and the last (cyclic) one is from abb to bbb.
The full Markov chain for k = 3 is illustrated in
the figure from the previous section.
- Simulating a trajectory.
Let s be the string corresponding to the first k
characters of the original text.
Print out s and start the Markov chain at state s.
Now, simulate the Markov chain for M - k steps, following
a random transition, and printing out the last letter
of each resulting state.
For example, if k = 3 and M = 12, the following is a possible
trajectory leading to the output bbbbabbbbabb.
bbb -> bbb -> bba -> bab -> abb -> bbb -> bbb -> bba -> bab -> abb
bbb b a b b b b a b b
Amusement.
Once you get the program working, you will find that the
random text generated from low-order models starts to sound more and more
like the original text as you increase the order, as illustrated in
the examples below. There are many more apparent pearls of wisdom in
the full texts on the Web.
As you can see, there are limitless opportunities for amusement here.
Try your model on some of your own text, or find something interesting
on the net. You need not limit yourself to the English language.
Don't know how to start? Help is here.
Submission.
Submit MarkovChain.java, RandomQueue.java, TextGenerator.java, and readme.txt.
This assignment was developed by Bob Sedgewick and Kevin Wayne,
based on the classic idea of Claude Shannon.
Copyright © 2004.