Problem Set 2: Using Abstract Datatypes

Procedural Abstraction and Using Abstract Datatypes

Due: Thursday, 9 September (beginning of class)

Collaboration Policy – Read Carefully:
This problem set is partitioned into two parts. For the first part (questions 1-5), you should work alone and write up your own answers. You may discuss the questions with anyone (and we recommend discussing them with your partner before starting the second part). For the second part, you are expected to work with as assigned partner. See the post on the course blog for the partner assignments. You and your partner should turn in your answers to the first part separately, but should turn in one submission for the second part with both of your names on it. Both partners must understand everything you turn in and should participate fully in the assignment. As always, you should follow everything described in the course pledge you signed. In particular, you should not consult assignments from previous cs2220/cs205 courses.

Purpose

  • Learn to reason about and develop good procedural specifications.
  • Learn to use an abstract data type by reading its specification.

  • Think about how to test a procedure.

Reading: read Chapters 3 and 4, Chapter 5 through Section 5.2, Chapter 9, and Chapter 10 through Section 10.2.

This assignment is significantly longer than PS1 and you have over a week to complete it. Please don’t wait to get started. You are strongly encouraged to take advantage of the scheduled office and help hours for this assignment.

Specifying Procedures


It is necessary for technical reasons that these warheads be stored upside down, that is, with the top at the bottom and the bottom at the top. In order that there be no doubt as to which is the bottom and which is the top, for storage purposes, it will be seen that the bottom of each warhead has been labeled ‘TOP’.

Instructions accompanying a shipment of ballistic missiles from British Admiralty (reported in The Humus Report)

For the next two questions, consider these two specifications for the
public static void sort(int[ ] a) procedure:

Specification A. From the Java SE 6 Platform API documentation (java.util.Arrays):

Sorts the specified array of ints into ascending numerical order. The sorting algorithm is a tuned quicksort, adapted from Jon L. Bentley and M. Douglas McIlroy’s “Engineering a Sort Function”, Software-Practice and Experience, Vol. 23(11) P. 1249-1265 (November 1993). This algorithm offers n*log(n) performance on many data sets that cause other quicksorts to degrade to quadratic performance.

Parameters:
   a – the array to be sorted.

Specification B. From the textbook (p.46):

MODIFIES: a
EFFECTS: Rearranges the elements of a into ascending order.
   e.g., if a = [3, 1, 6, 1], a_post = [1, 1, 3, 6]

1. Describe at least three advantages of specification B over specification A.

2. Describe one scenario where specification A is more useful.

Bonus. Specification A includes this sentence that attempts to describe the running time of the algorithm:

This algorithm offers n*log(n) performance on many data sets…

Explain why this sentence makes no sense, and suggest a better alternate wording.

Consider the histogram prodecure defined (but not specified!) below:

public static int [] histogram (int [] a)
// unspecified
{
   int maxval = 0;
   for (int i = 0; i < a.length; i++) {
      if (a[i] > maxval) {
         maxval = a[i];
      }
   }
   int histo [] = new int [maxval + 1];
   for (int i = 0; i < a.length; i++) {
      histo[a[i]]++;
   }
   return histo;
}

For example,

   int [] test = { 1, 2, 3, 4, 6, 4, 3, 3, 0 } ;
   int [] hist = histogram (test);

the value of hist will be [1, 1, 1, 3, 2, 0, 1].

3. Write a complete, declarative specification of histogram. Your specification should be enough for a client to safely use the implementation provided above.
4. Write an alternative specification for the histogram procedure that places less burden on the client by using exceptions. What are the advantages and disadvantages of this specification compared to your answer to question 3?
5. Write an alternative specification for the histogram procedure that is total. A client should be able to pass any array of integers into this procedure and obtain a meaningful return value. What are the advantages and disadvantages of this specification compared to your answers to questions 3 and 4?
This is the end of Part I of this assignment. You should write up and submit your answers to Part I individually, but are encouraged to discuss them with your partner before starting Part II. For the rest of this assignment, you are expected to work with your assigned partner and submit a single assignment with both of your names on it. Both partners should participate fully and equally, and should understand everything you turn in.

Part II: Document Similarity Analysis

The second part of this assignment is quite challenging and leaves it up to you to figure out how to solve a problem. I want you to think about this yourselves, but once you do I am happy to provide hints if you are stuck.

For the second part of this problem set, you will build a program for finding similarities between documents. This might be used to try to determine authorship of an anonymous writing, detect instances of plagiarism, or identify unexpected influences. The basic approach we will use is to count the number of common short sequences of words in pairs of documents. Documents with more common sequences are likely to be related.

Problem 6. Write a program that takes as input a list of file names and outputs a list of pairs of files sorted by the number of 3-length sequences they have in common.

As an example, if you extract the three documents in this zip file and run your program with command line arguments declaration.txt constitution.txt universityact.txt, your program should output something like:

[constitution.txt - universityact.txt: 82, declaration.txt - constitution.txt: 33, declaration.txt - universityact.txt: 16]

(indicating that the US Constitution and Act founding the University of Virginia are more similar than the Declaration of Independence and the Act). Obviously, there are some flaws with this approach, in particular, not normalizing for the relative lengths of the documents. Ambitious students will attempt to improve the analysis to produce better results, but simple sequence counts are enough for full credit on this assignment.

We have provided several datatypes that you may find helpful in solving this problem. Their specifications are available here:

Specifications

Download ps2.jar and install it to incorporate these types in your project.

The first is the WordWindow datatype for representing a fixed-length sequence of n words. WordWindow provides methods for inserting a word, and for retrieving the current sequence of the last n words as a single String.

The TallyTable datatype provides an abstraction that maps a String to an integer value. It provides the void tally(String) method that increases the tally associated with its input string, as well as methods for getting the tally associated with a string and a list of all the strings in the TallyTable.

The Document datatype provides an abstraction that represents a document and its associated word sequence tally. It provides a constructor that takes a String representing a file name as its input, and constructs a Document object that represents the document in that file. It uses the WordWindow and TallyTable datatypes to do this:

	/**
	 * Creates a new document from the file identified by fname, using window size window.
	 * @param fname
	 * @param window
	 * @throws FileNotFoundException
	 */
	public Document(String fname, int window) throws FileNotFoundException {
		name = fname;
		Scanner s = new Scanner(new File(fname)).useDelimiter("[^A-Za-z]");
		WordWindow w = new WordWindow(window);
		tt = new TallyTable();

		while (s.hasNext()) {
			String word = s.next().toLowerCase();
			w.insert(word);
			tt.tally(w.currentWindow());
		}

		for (int i = 0; i < window; i++) {
			w.insert(null);
			tt.tally(w.currentWindow());
		}
	}

If you use the provided Document class, you should not need to use TallyTable or WordWindow directly in your implementation.

The LabeledGraph datatype provides an abstraction for a labeled, undirected graph. It provides methods for adding nodes and labeled edges to a graph, and ArrayList getSortedEdges() for getting a list of the edges in the graph sorted by their labels. The EdgeRecord datatype is a record type for describing an edge in a graph (two nodes and a label).

You should feel free to use (or not use) these datatypes as you see fit.

Problem 7. Are the provided datatypes good data abstractions? Explain what deficiencies they have, and suggest better alternatives.
This is the end of this assignment. Remember to write up and submit your answers to Part I individually, but you and your partner should submit a single document with your answers to problems 6 and 7 with both of your names on it. Include all the code you wrote for Problem 6. (You can answer Problem 7 as a clearly marked comment in your code if you wish.)

4 Responses to Problem Set 2: Using Abstract Datatypes

  1. evans says:

    An enterprising student asked:

    I have a question about what part 6 is asking us to do.

    Suppose file1 contains: This is the text.
    And file2 contains: This isn't the text.

    Both seem to match the following substrings: “Thi”, “his”, “is “, “s i”, ” is”, ” th”, “the”, “he “, “e t”, “te”, “tex”, “ext”, “xt.”

    However in the description, it says that the outputs should (or at least could) be in the double digits. In my example there are already 13 matches, so this seems far too low. Is my reasoning about the problem correct?

    Comparing texts at the character granularity would be a sensible thing to do (in some cases), but here we are breaking the text into “words”, where a word is a sequence of characters separated by a punctuation or whitespace marker.

    The good news is that we have provided the Document constructor to do the tallying for you, so you do not need to worry about this. You can use the Document constructor like this:

    Document d;
    try {
            d = new Document(file, window);
    } catch (FileNotFoundException fnfe) {
    	System.err.println("Error: cannot open file: " + file + " [" + fnfe + "]");
    }
    

    where file is a String that contains the name of the file to analyze and window is an integer representing the window size (e.g., 3).

    Here’s how I implemented the Document constructor:

    public Document(String fname, int window) throws FileNotFoundException {
    	name = fname;
    	Scanner s = new Scanner(new File(fname)).useDelimiter("[^A-Za-z]");
    	WordWindow w = new WordWindow(window);
    	tt = new TallyTable();
    
    	while (s.hasNext()) {
    		String word = s.next().toLowerCase();
    		w.insert(word);
    		tt.tally(w.currentWindow());
    	}
    
    	for (int i = 0; i < window; i++) {
    		w.insert(null);
    		tt.tally(w.currentWindow());
    	}
    }
    

    Because of the useDelimiter("[^A-Za-z]") it is break words where there is any non-alphabetic character.

  2. nmd3ey says:

    Professor,

    We are working on part II of ps2 and we believe we have the basic program down. When we’ve attempted to debug the program (manually), we found that the program cannot find the documents (constitution.txt, etc).

    Our first question is where should we save these files and as what format (should we keep the folder zipped or extract it)?
    We have presently saved the files in the same folder location as the .java class.

    Our second question is how should we tell the program to search for the documents?
    We are currently trying to run the location of file in the (x)=arguments of the configuration.

    Thank you for the help.

    • David Evans says:

      The Java runtime will look for the documents in the current working directory. In Eclipse, this is determined by the Working Directory setting. You can edit it by opening the Run Configurations dialog (select Run | Run Configurations), and then select the “Arguments” tab. The default value is probably set to the bin directory in your workspace. If you select Other you can select any directory you want as the working directory. This should be the directory where you extracted the test files.

      For the arguments, just list the filenames, separated by spaces.

  3. David Evans says:

    A student asks,

    Once we downloaded ps2.jar and installed it via the “add external JARs” method we used for PS1, how do we import all the classes?

    To use the provided classes, you need to import them into your program by adding the line,

    import ps2.*;
    

    to the beginning of your file.

    If you use the API types, you will need imports for those also:

    import java.io.FileNotFoundException;
    import java.util.ArrayList;
    import java.util.Set;