ulrail.gif

Homework 6 Checklist

  ur-2c.gif urrail.gif
blrail-nonavbar.gif

Home | Resources | Assignments | Exams
Lectures | Course Staff | Submit

br-2a.gif   br-2c.gif brrail.gif
spacer.gif

spacer.gif

Frequently Asked Questions

What are the main goals of this assignment? Use a graph-like data structure, learn about natural language processing, and learn about Markov chains.

What is the origin of the Markov text generator? It was first described by Claude Shannon in 1948. The first computer version was apparently written by Don P. Mitchell, adapted by Bruce Ellis, and popularized by A. K. Dewdney in the Computing Recreations section of Scientific American. Brian Kernighan and Rob Pike revived the program in a University setting and described it as an example of design in The Practice of Programming. The program is also described in Jon Bentley's Programming Pearls.

How do I read in characters from standard input? Use StdIn.readAll().

Given a string s, is there an efficient way to get a substring? Use s.substring(i, i+k) to get the k-character substring starting at i. In Java, this operation takes constant time, regardless of the size of k. This efficient construction is possible because strings are immutable, which enables the library to reuse the relevant part of the original string's character array. See Section 3.1 for more information on strings.

Why treat the text as a cyclic string? This prevents the Markov chain from ever getting "stuck", i.e., entering a state with no outgoing transitions.

How do I emulate the behavior of a cyclic string? There are a number of approaches. The easiest might be to concatenate the first k characters to the end of the text.

Should my program generate a different output each time I run it? Yes.

My random text ends in the middle of a sentence. Is that OK? Yes, that's to be expected.

For which values of k should my program work? It's fine to assume k > 0. Naturally, as k gets larger, your program will use more memory and take longer. Also, you can assume k <= M.

I get an OutOfMemoryException. How do I tell Java to use more of my computer's memory? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory beyond the 64MB default.

java -Xmx100m TextGenerator 7 1000 < input.txt
The 100m means 100MB, and you should adjust this number depending on the size of the input text.

Why not store the probability of each transition instead of using parallel edges? It's easier to implement this way. Storing the probabilities would be more efficient and more general, but it's not necessary on this assignment.

Submission

readme.txt. Use the following readme file template.

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download the directory markov (available also as a .zip file). It contains an implemention of Graph.java, ST.java, SET.java, and Queue.java which you can use as a starting point.

  2. Rename Queue.java to a file RandomQueue.java and modify it to model a random queue.
    • Rename enqueue() to add().
    • Delete dequeue().
    • Include a method sample() that returns a random item of the queue (or null if the queue is empty).
    • Test out your routine on inputs of size 0, 1, and 2. Make sure that it returns each element with equal probability. Use TestRandomQueue.java to help you test.

  3. Rename Graph to MarkovChain and modify it to model a Markov chain.
    • Replace SET with RandomQueue to allow duplicates.
    • Replace addEdge() with addTransition(). Modify it so that it models a directed edge (instead of an undirected one).
    • Replace addVertex() with addState().
    • Delete the method adjacentTo() and add the method next().
    • Be sure to thoroughly test your modifications. The last thing you want is to have an undiscovered bug in MarkovChain when you're writing the client. Use TestMarkovChain.java to help you test. It implements the example in the assignment description.

  4. Write the client program TextGenerator.java in small steps.
    • Read in the data and print it back out.
    • Next, add a command line parameter k, and print out all substrings of length k.
    • Next print out all pairs of adjacent k-grams instead, and make it work with cyclic wrap-around.
    • Insert these pairs into the Markov chain and print out the Markov chain.
    • Remove all of the debugging print statements and simulate a random trajectory through the Markov chain.
    • Thoroughly test and debug each piece as soon as you write it.

Enrichment

  • Here's a website that generates pseduo-random computer science papers. It uses something called a context free grammar instead of a Markov chain, but otherwise is similar in spirit to what you are doing on this assignment.

  • What can I do with my random text? CNN reports that one former Princeton student read his output at the Frist filibuster.
spacer.gif
spacer.gif footer-middle.gif spacer.gif