University of Virginia, Department of Computer Science CS201J: Engineering Software, Fall 2002

 Honor Code Reminder: If you are currently taking CS201J, it is a violation of the course pledge to look at Problem Set answers and discussions from previous years. Please don't do it.

 Problem Set 6: Phylogeny Phrees - Comments

### Phylogeny Revisited

Average: 54.1 out of 70. To get a Gold Star you needed 63 or higher.

1. (Average: 5 out of 5) Compile and run the C version of the program with several test inputs of different sizes. What is the largest number of species the program can handle? What happens if the program is given too many species?

Results varied depending on how much memory the machine you tried this on had, and what other programs were running at the time. For most of the ITC machines, it would work with 7 species, but run out of memory with 8 species. When it runs out of memory, you would get a error message like Unable to create new SpeciesSet. Out of memory in SpeciesSet_new after one of the calls to malloc returns NULL. Note that if we hadn't put the checks in the code after calls to malloc, the program would probably just crash without explanation. You may have also got a Windows pop-up warning about "Virtual Memory Minimum Too Low" as the process came close to running out of memory.

2. (7.6 / 10) Estimate how much memory the program would need to compute a tree of size 8. You will need to make approximations and simplifying assumptions in order to do this; state your assumptions.

• Each SpeciesTree_rep node takes 12 bytes of memory: there are three pointers, each requiring 4 bytes. Note that the size of the species names and genomes doesn't matter much. Only one copy of each name and genome is stored, and each tree node will point to the same Species object.
• We assume that a typical SpeciesTree has eight nodes, consuming 8 * 12 = 96 bytes of memory.
• From problem set 4, there are 168,210 possible eight-node trees. If each of these allocates a tree, 168,210 * 8 = 1,345,680 SpeciesTree_rep nodes will be created. This is a conservative estimate. Although there are only 168,210 different trees when we ignore equivalent but differently-ordered trees, the algorithm is inefficient and actually examines more trees than this because it does not consider symmetry. At each level of the search, it examines twice as many trees as necessary. Because a typical tree is log(8) = 3 levels deep, this means that the algorithm examines more trees than needed by a factor of 23 = 8. This would require create 1,345,680 * 8 = 10,765,400 SpeciesTree_rep nodes. Each of these nodes occupies 12 bytes, requiring 129,185,280 bytes of memory total.
• Even this is a conservative estimate, because it considers the memory occupied by SpeciesTree nodes only. The algorithm also constructs three SpeciesSet objects for each tree node, and these also occupy memory. Many of these SpeciesSets will be empty, because they occur in leaf nodes, but the others may contain up to eight s_SpeciesSetNode structures. We'll assume the average SpeciesSet contains one node, requiring 4 bytes for the s_SpeciesSet_rep and 8 bytes for the s_SpeciesSetNode. The SpeciesSets would then occupy an additional 10,765,400 * 3 * 12 = 387,554,400 bytes of memory.
All together, the program would require roughly 516 Mb of memory just to store the SpeciesTree and SpeciesSet objects created by the algorithm. The C library will use additional memory to keep track of what parts of memory are occupied, and the operating system and the program's code itself require additional memory.

3. (9.4 / 10) Run Splint to find code in Phylogeny.c that violates the Species data abstraction. For each kind of violation found, explain why it violates the abstraction, and change the code to fix the abstraction violation. When you are done, Splint should report no abstraction violations in the program. (Splint will still report some other problems, which you will deal with in the next part. If you run splint -nullpass -mustfree -branchstate Phylogeny.c after fixing the abstraction violations, no warnings should be reported.)

In several places, the original program accesses members of the struct that represents a Species object directly:
```score = parsimonyValue (tree) +
mutationDistance (root->genome, grandparent->genome);
```
Splint reports these abstraction violations as arrow accesses of non-pointers:
```Phylogeny.c: (in function chooseRoot)
Phylogeny.c:146:29: Arrow access of non-pointer (Species): root->genome
Phylogeny.c:146:50: Arrow access of non-pointer (Species): grandparent->genome
```
This is an abstraction violation because the client should not access the representation directly or assume anything about its structure. We change the code to access the object through the abstract type's observer methods instead:
```score = parsimonyValue (tree) +
mutationDistance (Species_getGenome(root),
Species_getGenome(grandparent));
```

4. (8.4 / 10) The SpeciesSet and SpeciesTree datatype should also be abstract. Add /*@abstract@*/ annotations to their type definitions (in the SpeciesSet.rh and SpeciesTree.rh files). Use Splint to find code in Phylogeny.c that violates these data abstractions. For each kind of violation found, change the code to fix the abstraction violation. When you are done, Splint should report no abstraction violations in the program. (Splint will still report some other problems, which you will deal with in the next part. If you run splint -nullpass -mustfree -branchstate Phylogeny.c after fixing the abstraction violations, no warnings should be reported.)

The original program also accessed the representation of the SpeciesTree type directly:
``` 45 SpeciesTree left = tree->left;
```
```Phylogeny.c: (in function parsimonyValue)
Phylogeny.c:45:26: Arrow access of non-pointer (SpeciesTree): tree->left
Types are incompatible. (Use -type to inhibit warning)
```
The correct way to use the abstraction is to use an observer method:
```    SpeciesTree left = SpeciesTree_getLeft(tree);
```

Another kind of violation found later on in parsimonyValue compares a SpeciesTree pointer to NULL directly:

``` 48 if (left != NULL) {
...
```
```Phylogeny.c:48:7: Left operand of != is abstract type (SpeciesTree):
left != NULL
An abstraction barrier is broken. If necessary, use /*@access @*/ to
```
This assumes that an empty SpeciesTree is represented by a NULL pointer, which might not be the case. A correct program would not assume this and would use an accessor method instead:
```    if (!SpeciesTree_isNull (left)) {
...
```

5. (8.8 / 10) Identify a possible NULL pointer dereference in Phylogeny.c. (If you run Splint with the -mustfree -branchstate flags, the other warnings will not be reported.) Run the program in a way that reveals the problem. Fix the code, so the program exits gracefully with an appropriate error message instead.

In the main procedure, which processes the input file, there is a potential NULL pointer dereference:
```165 FILE *f;
...
173 f = fopen(argv[1], "rt");
174
175 while (fgets (line, MAX_LINE_LENGTH, f)) {
...
```
```Phylogeny.c:175:31: Possibly null storage f passed as non-null param:
fgets (..., f)
A possibly null pointer is passed as a parameter corresponding to a formal
parameter with no /*@null@*/ annotation.  If NULL may be used for this
parameter, add a /*@null@*/ annotation to the function parameter declaration.
(Use -nullpass to inhibit warning)
Phylogeny.c:173:9: Storage f may become null
```
The problem here is that if fopen is unable to open the specified file, it will return a NULL pointer to indicate the failure. If this pointer is then used by fgets to read data from the file, a NULL pointer dereference will occur. Splint detects this because its specification for fopen indicates that a NULL pointer could be returned, and its specification for fgets indicates that it expects a non-null parameter:
```/*@null@*/ /*@dependent@*/ FILE *fopen (char *filename, char *mode)
/*@modifies fileSystem@*/ ;

/*@null@*/ char *
fgets (/*@returned@*/ /*@out@*/ char *s, int n, FILE *stream)
/*@modifies fileSystem, *s, *stream, errno@*/
/*@requires maxSet(s) >= (n -1); @*/
;
```

The solution is to check to see if the pointer is NULL, and if it is, print an appropriate error message and exit:

```    f = fopen (argv[1],"rt");
if (f == NULL) {
fprintf (stderr, "Cannot open input file for reading: %s\n",argv[1]);
exit (EXIT_FAILURE);
}

while (fgets (line, MAX_LINE_LENGTH, f)) {
...
```

6. (12.5 / 20) Use Splint to find and correct memory leaks in Phylogeny.c. You will need to insert calls to SpeciesSet_free and SpeciesTree_free in appropriate places and add a few annotations to do this, but you will not need to modify any file other than phylogeny.c. You are not done until Splint reports no remaining memory leaks in the program.

There were several memory leaks in the program. Some were difficult to find, even using Splint. The corrected version of Phylogeny.c fixes all of the memory leaks.

One pair of memory leaks that was particularly difficult to find was in chooseRoot:

```101 if (score < best_score) {
102     best_tree = tree;
103     best_score = score;
104 }
```
This code fragment checks to see if the candidate tree is the best tree seen so far, and if it is, replaces the best_tree reference with a reference to the candidate tree.

There are two problems with this fragment:

1. If the branch is taken, the reference to the old value of best_tree is lost when the assignment is executed. Because there is no other reference to the best tree, and it is not released first, a memory leak occurs.
2. If the branch is not taken, the reference to the candidate tree is no longer needed. Because the reference is not assigned to best_tree or otherwise saved, the last reference to it will go out of scope without being released, causing a memory leak.
The solution to this is to release best_tree before the assignment if the branch is taken, and to release the candidate tree if the branch is not taken:
```    if (score < best_score) {
/* We must release the old best_tree before we assign
* something else to it (and lose the reference */
SpeciesTree_free(best_tree);
best_score = score;
best_tree = tree;
tree = SpeciesTree_NULL;
} else {
/* Otherwise, we must free the temporary tree */
SpeciesTree_free(tree);
tree = SpeciesTree_NULL;
}
```

A similar code fragment appears in findBestTreeRoot, and the same problem occurs there.

Another tricky one to solve occurs in main where the program attempts to print each Species as it reads it:

```244 Species snew = Species_new (name, genome);
245 fprintf (stderr, "Species: %s\n", Species_toString(snew));
246 Species_markOwned (snew);
247 SpeciesSet_insert (set, snew);
```
```Phylogeny.c:245:37: New fresh storage (type char *) passed as implicitly temp
(not released): Species_toString(snew)
```
The problem is that Species_toString returns an /*@only@*/ pointer, carrying with it the responsibility to release it, but the result is passed to fprintf which will not do so. The solution requires creating a temporary variable to hold the return value of Species_toString long enough to print it and release it:
```    Species snew = Species_new (name, genome);
char *str = Species_toString(snew);
fprintf (stderr, "Species: %s\n", str);
free(str);
Species_markOwned (snew);
SpeciesSet_insert (set, snew);
```

There were several other memory leaks that generally involved temporary storage that was allocated but not released. Most of these problems could be solved by adding a call to the appropriate version of free at the point where the pointer in question goes out of scope. See Phylogeny.c for all the changes we made.

7. (4.4 / 5) Test the program again on the same instances you used for Question 1. What is the largest instance the program can solve now? (It may take up to half an hour to run an instance with 9 species on a fast computer, but the program should not run out of memory if all of the leaks are removed.)

On a 128Mb machine, the corrected program will successfully complete an instance with 9 species without running out of memory. Beyond that, the main issue is time: the brute-force algorithm used does not scale well to large trees, and would take prohibitively long to execute.

On a 450MHz machine with 128Mb of memory that we tested the program on:

Instance SizeTime to Complete
60.25 seconds
74 seconds
81.5 minutes
929 minutes

For no instance did the corrected program use more than 500kb of memory.