University of Virginia, Department of Computer Science
CS201J: Engineering Software, Fall 2002

Problem Set 6: Phylogeny Phrees Out: 31 October 2002
Due: 12 November 2002

Purpose

In the first part of this assignment, you will do some exercises that develop your understanding of how memory is allocated and used in C programs. In the second part, you will use a lightweight analysis tool, Splint, to analyze an existing C program and add explicit memory management to it.

Collaboration Policy (same as PS2)

For this problem set, you may either work alone and turn in a problem set with just your name on it, or work with one other student in the class of your choice. If you work with a partner, you and your partner should turn in one assignment with both of your names on it.

Regardless of whether you work alone or with a partner, you are encouraged to discuss this assignment with other students in the class and ask and provide help in useful ways. You may consult any outside resources you wish including books, papers, web sites and people. If you use resources other than the class materials, indicate what you used along with your answer.

Phylogeny Revisited

Download: ps6.zip

Create a cs201j sub-directory in your home directory, and a ps6 subdirectory in that directory. Unzip ps6.zip in that subdirectory by executing unzip ps6.zip in a command shell.

If you are using the ITC lab machines, Splint is already installed in G:\apps\win32\java\cs201j\splint-3.0.1.6. If you are working from home, you will need to download and install Splint from http://www.splint.org.

A batch file that runs Splint from the command line is included in ps6.zip. To run it from within Visual Studio, select Tools | Customize, and add a new command to the command list named "splint". Enter the location of splint.bat from the downloaded zip file as the command to run, and select the Use Output Window box. You can then run Splint on your files directly from the Tools menu. If you choose to run Splint and Visual Studio on your own PC instead of in the lab, you may need to edit the batch file to include the correct directories for Splint and Visual Studio.

In Problem Set 4 you designed and wrote a program in Java that takes a list of species and genomes and produces a phylogeny tree that minimizes the number of mutations. This program used several abstractions and dynamic data structures.

In this assignment, you will analyze and improve upon a C implementation of this program. The C version uses data types that are similar to the ones used in the Java implementation provided in the solution to Problem Set 4.

The C program uses three abstract data types:

Each of these types has a set of C procedures that are used to create, access, manipulate, and destroy the data structure. Each data type has a corresponding header file (e.g., Species.h that defines the external interface and specifies its operations) and implementation file (e.g., Species.c that implements the datatype).

The program finds the most parsimonious phylogeny tree for a given SpeciesSet using a recursive algorithm based on the algorithm used in the Problem Set 4. The algorithm used in this assignment is different in one important respect: rather than creating a set of all possible trees and then selecting the best one, this algorithm evaluates the trees as it generates them, and keeps only the best tree. This way it is not necessary to build a list of all possible trees, and no SpeciesTreeSet data structure is needed. The algorithm is implemented by the functions chooseRoot and findBestTreeRoot in Phylogeny.c.

Notice that in the process of determining which arrangement of Species results in the best phylogeny tree, the program builds many temporary SpeciesSet and SpeciesTree structures. However, each recursive call returns only one SpeciesTree object; the rest are no longer needed and could be discarded. The program as written does not discard them.

1. Compile and run the C version of the program with several test inputs of different sizes. What is the largest number of species the program can handle? What happens if the program is given too many species?

2. Estimate how much memory the program would need to compute a tree of size 8. You will need to make approximations and simplifying assumptions in order to do this; state your assumptions.

Hints:

Data Abstraction in C

C, unlike Java, does not have explicit features for writing object-oriented programs or for enforcing abstraction: in particular, the language does not provide a way to specify that a datatype is abstract and limit where its representation can be manipulated. Despite this, if we design our programs well, we can get most of the benefits of data abstraction in C. Recall that our primary goal with data abstract is to isloate the parts of the program that depend on how a datatype is implemented, and allow clients to manipulate that type only through abstract operations.

We can implement an abstract data type in C by separating the interface to the abstraction from the concrete representation of it. Client modules should access the data type only through defined accessor and mutator methods, and should not attempt to access or manipulate the data type's representation directly.

Splint is a lightweight analysis tool that can be used to detect abstraction violations in C programs. Splint allows abstractions to be specified using annotations, much like the way ESC/Java allows additional restrictions to be specified for Java programs.

The /*@abstract@*/ annotation, when included in a data type declaration (a C typedef), indicates that a particular data type should be treated as an abstract data type. This means that Splint will check that client modules do not access the representation of the type directly. For example, we declare the abstract datatype Species in Species.rh (we put this in a separate file instead of Species.h, since the client should not need to see it) using:

    struct Species_rep {
       /*@only@*/ const char *name;
       /*@only@*/ const char *genome;
    } ;

    typedef /*@abstract@*/ struct Species_rep *Species;

Functions in the Species implementation module are allowed to access the members in the concrete representation Species_rep directly, but outside functions may only refer to the abstract Species type and access it through functions defined in the Species module. If any outside module accesses the representation directly, Splint will notice this and print a warning.

3.Run Splint to find code in Phylogeny.c that violates the Species data abstraction. For each kind of violation found, explain why it violates the abstraction, and change the code to fix the abstraction violation. When you are done, Splint should report no abstraction violations in the program. (Splint will still report some other problems, which you will deal with in the next part. If you run splint -nullpass -mustfree -branchstate Phylogeny.c after fixing the abstraction violations, no warnings should be reported.)

4. The SpeciesSet and SpeciesTree datatype should also be abstract. Add /*@abstract@*/ annotations to their type definitions (in the SpeciesSet.rh and SpeciesTree.rh files). Use Splint to find code in Phylogeny.c that violates these data abstractions. For each kind of violation found, change the code to fix the abstraction violation. When you are done, Splint should report no abstraction violations in the program. (Splint will still report some other problems, which you will deal with in the next part. If you run splint -nullpass -mustfree -branchstate Phylogeny.c after fixing the abstraction violations, no warnings should be reported.)

Null Dereferences

In Java, if a program attempts to use a reference whose value is null, a NullPointerException is generated, and if this is not handled, the program will stop. C does not automatically check to make sure that NULL pointers are not dereferenced. If a program attempts to do this, it may crash, or it may corrupt data in some other part of the program's memory, resulting in a difficult-to-debug problem.

Splint can be used to detect and prevent possible NULL pointer dereferences in C programs. The /*@null@*/ and /*@notnull@*/ annotations are used to indicate whether or not it is possible for a particular variable or function to have a value of NULL. For example, the standard library defines a function fopen that opens and returns a file. If the file cannot be opened, fopen returns NULL. (Since C does not have exceptions, it cannot throw an exception as would be done in Java.) The Splint library annotates fopen as:

/*@null@*/ /*@dependent@*/ FILE *fopen (char *filename, char *mode) 
   /*@modifies fileSystem@*/ ;         
The methods implementing the Species, SpeciesSet, and SpeciesTree data types have already been annotated, but Phylogeny.c contains at least one error.

5. Identify a possible NULL pointer dereference in Phylogeny.c. (If you run Splint with the -mustfree -branchstate flags, the other warnings will not be reported.) Run the program in a way that reveals the problem. Fix the code, so the program exits gracefully with an appropriate error message instead.

Explicit Memory Management and Memory Leaks

In Java, all objects are created explicitly using the new operator. It is not necessary to explicitly destroy objects in Java when they are no longer needed because the Java Virtual Machine employs a garbage collector that automatically finds unused objects in memory and destroys them, allowing the memory they occupied to be reused.

In C, memory must be allocated and reclaimed explicitly. It is up to the programmer to make sure to correctly allocate and reclaim memory. Memory can be allocated using the malloc library function, and released using free:

/*@null@*/ /*@out@*/ /*@only@*/ void *malloc (size_t size) /*@*/ ;

void free (/*@null@*/ /*@out@*/ /*@only@*/ void *p) /*@modifies p@*/ ;

When a C program fails to deallocate a data structure that is no longer needed, this is called a memory leak. Even though the data structure is no longer needed and will not be referred to again, the memory occupied by that data structure is marked as used, and the computer will not re-use it unless it is specifically told to reclaim it. A program that leaks memory will eventually run our of memory and fail.

The C version of the phylogeny program does not free any of the data structures it allocates, even when the data structures are no longer needed and will not be referred to again. This causes the program to run out of memory. The program could handle larger trees if it re-used this memory.

Using Splint to Find Memory Leaks

Splint can be used to find memory leaks in C programs. When it is run on a C program, it analyzes the flow of a program, and notices when a dynamically allocated data structure goes out of scope without being reclaimed. Because these situations indicate likely memory leaks, Splint prints a warning.

Consider the following (simplistic) program with a memory leak:

#include <stdio.h>
   int i = 0;

   while (1) {	
     char *s = (char *) malloc (sizeof (char) * 1000);
     i++;
     printf ("Iteration: %d\n", i);
   }
}
When we run Splint on this program, we get the following warning:
Splint 3.0.1.6 --- 11 Jun 2002

leak.c: (in function main)
leak.c:10:5: Fresh storage s not released before scope exit
  A memory leak has been detected. Storage allocated locally is not released
  before the last reference to it is lost. (Use -mustfreefresh to inhibit
  warning)
   leak.c:7:55: Fresh storage s allocated
This tells us that the memory for s, allocated on line 7, is not released when the show exits (line 10).

The storage returned by malloc is annotated with /*@only@*/. Hence, the call site becomes the only reference to this storage. At the call site, the returned storage is assigned to s. This transfers the ownership obligation of the storage to the local variable, s. When the scope exits, there is no way to use s anymore. This is a memory leak, since the obligation to transfer ownership of the storage was not satisfied.

We can fix the leak by inserting a call to free. The parameter of free is annotated with only, so the reference passed to free is now owned by the called function (which returns the storage to the system).

The same technique can be used to find and correct the memory leaks in the Phylogeny program. More details on how Splint can be used to analyze a program's memory management can be found in Chapter 5 of the Splint Manual.

6. Use Splint to find and correct memory leaks in Phylogeny.c. You will need to insert calls to SpeciesSet_free and SpeciesTree_free in appropriate places and add a few annotations to do this, but you will not need to modify any file other than phylogeny.c. You are not done until Splint reports no remaining memory leaks in the program.

7. Test the program again on the same instances you used for Question 1. What is the largest instance the program can solve now? (It may take up to half an hour to run an instance with 9 species on a fast computer, but the program should not run out of memory if all of the leaks are removed.)

Credits: This problem set was developed by Joel Winstead and David Evans for UVA CS 2001J Fall 2002.


CS201J University of Virginia
Department of Computer Science
CS 201J: Engineering Software?
Sponsored by the
National Science Foundation
cs201j-staff@cs.virginia.edu