CS201J: Problem Set 4: Clandestine Cladistics

University of Virginia, Department of Computer Science
CS201J: Engineering Software, Fall 2002

Problem Set 4: Tree of Life (Designing with ADTs) Comments

Honor Code Reminder: If you are currently taking CS201J, it is a violation of the course pledge to look at Problem Set answers and discussions from previous years. Please don't do it.

General Comments

Design. The groups with simple designs were generally able to implement their design successfully. This was a complicated enough problem that groups that didn't start with a clear design were not able to get their code working, no matter how clever they were.
Dependencies. Devising an implementation and testing strategy that allowed you to build individual parts and know they work was very important. A few groups attempted to test their complex produces for calculating all possible trees before testing the set datatypes it depends on. This may seem like a good way to save time, but its really risky. A small problem in your set implementation can make debugging all possible trees procedures a nightmare.
Representations. Almost every group presented rep invariants and abstraction functions without first defining your rep. It doesn't make any sense to include a rep invariant or abstraction function in a specification. Its part of the implementation — it only makes sense to place constraints on the rep, or map it to an abstract notation, once you have decided what the rep is.
Rep Exposure. Many groups implemented a Tree type like this:
public class Tree {
   // OVERVIEW: A Tree is a mutable type that represents a tree.  
   //    A typical Tree is < root, left, right > where root is
   //    a species and left and right are trees or null.
   
   /*@non_null@*/ private Species root;
   private Tree left;
   private Tree right;

   public Tree (/*@non_null@*/ Species s, Tree l, Tree r) {
      root = s; left = l; right = r;
   }

   public Tree getLeftChild () {
      return left;
   }
}
This exposes the rep. The client can call getLeftChild to get access to a mutable part of the rep. A client who constructs a tree using Tree (s, l, r) has access to s, l and r in the rep. The best solution to this is to make Tree an immutable type. This means it is okay for the client to have access to the Tree components of the rep, since they cannot be modified. Further, we need to make Species immutable also.
ESC/Java. Many groups assigned one person to doing the ESC/Java checking, or left this until the end. This isn't a good strategy. If you are using ESC/Java effectively, it will make it easier to develop your code. You should be annotating your code as you write it, not waiting until you are done to do this. Thinking about and documenting the assumptions you are making while you write the code should help you avoid lots of wasted debugging time. Annotating your rep invariants should help you check you understand what they are, and implement the code in a way that preserves the rep invariant.
Modular Dependency Diagram

My code for PS4 is here [zip file]:

Species.java - immutable record type for representing a species name and genome
Genome.java - immutable type for representing a sequence of base pairs (or features)
Includes the mutationDistance method
SpeciesTree.java - immutable type for representing a tree of Species
SpeciesSet.java - mutable type for representing a set of Species
SpeciesSet.java - mutable type for representing a set of Species objects
SpeciesTreeSet.java - mutable type for representing a set of SpeciesTree objects
AllTrees.java - non-ADT class that provides the allTreesRoot method
Phylogeny.java - defines main
The trickiest part is AllTrees, which is shown below:
import java.util.*; import java.math.BigInteger; class Split { // OVERVIEW: Record type for returning a split. public /*@non_null@*/ SpeciesSet left; public /*@non_null@*/ SpeciesSet right; public Split (/*@non_null@*/ SpeciesSet left, /*@non_null@*/ SpeciesSet right) { this.left = left; this.right = right; } public String toString () { return "< " + left.toString () + ", " + right.toString () + ">"; } } public class AllTrees { // OVERVIEW: AllTrees provides the allTreesRoot method that returns the set of all possible // trees that can be produced from s with a given root. private static Split [] allSplits (SpeciesSet s) //@ensures \nonnullelements (\result) // EFFECTS: Returns an array of splits showing all possible ways of splitting the s into two groups. // e.g., allSplits ({a, b, c}) = [<{a, b, c}, {}>, <{a, b}, {c}>, <{a}, {b, c}>, <{a, c}, {b}>] { if (s == null) { return null; } // There are 2^n groups, we can enumerate them using the bits BigInteger numgroups = BigInteger.valueOf (2).pow (s.size ()); //@nowarn Null ; //@assume numgroups != null; // ESC/Java spec for BigInteger is missing, need to assume. int numiters = numgroups.intValue () / 2; //@assume numiters >= 0; Split [] result = new Split [numiters]; Species [] elements = s.toArray (); BigInteger b = BigInteger.ZERO; //@assume b != null; for (int i = 0; i < numiters; i++) { SpeciesSet left = new SpeciesSet (); SpeciesSet right = new SpeciesSet (); for (int bitno = 0; bitno < numgroups.bitLength () - 1; bitno++) { //@assume bitno < elements.length ; if (b.bitLength () >= bitno && b.testBit (bitno)) { right.insert (elements [bitno]); } else { left.insert (elements [bitno]); } } result[i] = new Split (left, right); //@nowarn IndexNegative, IndexTooBig b = b.add (BigInteger.ONE); //@assume b != null } return result; } //@nowarn Post // can't establish no null elements postcondition public static /*@non_null@*/ SpeciesTreeSet allTreesRoot (SpeciesSet s, /*@non_null@*/ Species r) { // EFFECTS: Returns the set of all trees with root r, and leaves s. SpeciesTreeSet res = new SpeciesTreeSet (); if (s == null || s.isEmpty ()) { res.insert (new SpeciesTree (r, null, null)); return res; } Split [] splits = allSplits (s); if (splits == null) { System.err.println ("BUG: no splits (shouldn't happen when s is non-empty)"); System.exit (1); } for (int i = 0; i < splits.length; i++) { Split current = splits[i]; SpeciesTreeSet ltrees = allTrees (current.left); SpeciesTreeSet rtrees = allTrees (current.right); Enumeration liter = ltrees.elements (); if (!ltrees.isEmpty ()) { while (liter.hasMoreElements ()) { SpeciesTree ltree = (SpeciesTree) liter.nextElement (); if (!rtrees.isEmpty ()) { Enumeration riter = rtrees.elements (); while (riter.hasMoreElements ()) { SpeciesTree rtree = (SpeciesTree) riter.nextElement (); res.insert (new SpeciesTree (r, ltree, rtree)); } } else { res.insert (new SpeciesTree (r, ltree, null)); } } } else { System.err.println ("BUG: left trees cannot be empty!"); System.exit (1); } } return res; } private static /*@non_null@*/ SpeciesTreeSet allTrees (SpeciesSet s) { // EFFECTS: Returns the set of all possible trees that can be produced from s. if (s == null || s.isEmpty ()) { return new SpeciesTreeSet (); } SpeciesTreeSet res = new SpeciesTreeSet (); Species [] elements = s.toArray (); if (elements.length == 1) { res.insert (new SpeciesTree (elements[0])); } else { for (int i = 0; i < elements.length; i++) { // Create a new set with all elements except this one. SpeciesSet rest = new SpeciesSet (s); rest.remove (elements[i]); res.union (allTreesRoot (rest, elements[i])); } } return res; } }

Testing Strategy
A good testing strategy would describe both unit testing and integration testing.
For unit testing, you should take advantage of your desing to implement in stages that can be tested independently. For example, with my design, you might first implement Species and Genome without the mutationCount methods. Someone else would implement the basic set datatypes and make SpeciesSet. You could then test SpeciesSet. The AllTrees modules can be broken into AllPossibleDivisions and AllPossibleTrees. We can test AllPossibleDivisions independently from the other classes once we have any set datatype. A good test driver would test it by checking the size of the set returned, and checking if some particular element is in that set.
The SpeciesTree can be unit tested with some simple cases (no chlidren, one child, two children), and printing should be tested for a few trees. The AllPossibleTrees module will be tested on some simple cases by hand, but mostly tested using our integration tests.
The integration tests should consider all the boundary cases for running the program. In particular, we should consider bad input (wrong number of arguments, bad files) as well as good input. It is hard to tell for an arbitrary input set if the tree produced is the most parsimonious. Hence, we should construct some test cases where it is easy to tell if the result is correct. For example, an input where each genome differs by one from the one above it, and an input where each genome is identical.
The tests I ran to evaluate your programs were:
java Phylogeny cat.spc
java Phylogeny aphid.spc
java Phylogeny
java Phylogeny empty.spc
java Phylogeny badformat.spc
java Phylogeny duplicates.spc
java Phylogeny identical.spc
java Phylogeny chain.spc
java Phylogeny binary.spc
Grading

The problem set was graded with 25 points for design (average 22), 25 points for implementation (average 20), 20 points for testing strategy (average 14), and 30 points for the test results (average 14). This assignment was hard enough that you should be too worried if you were not able to get things to work, and as a result got a low score on the test results. The main point of this assignment was to learn about design. Don't get so focused on making the code work that you forget to document your testing strategy.
Four of the groups were able to get almost all the tests to work (which I think is very impressive!). No group got absolutely all of them; both groups that came closest could have found the problems revealed by the tests using ESC/Java.

University of Virginia
Department of Computer Science
CS 201J: Engineering Software

Sponsored by the
National Science Foundation

cs201j-staff@cs.virginia.edu