[an error occurred while processing this directive]

Problem Set 3 Comments

1. Define a method, equal in the Tree class that takes a tree as its parameter, and evaluates to true if and only if the input tree is equal to self. Two trees are considered equal if they have the same branching structure and the values of every node are the same (== comparison) in both trees. Your definition should be recursive (it cannot use any looping control structure such as for or while).

Because we have to deal with None, the code is fairly awkward and complex:

def equal(self, t):
   if not self.__value == t.__value:
       return False

   if self.getLeft () == None:
       if not t.getLeft () == None:
           return False
   else:
       if t.getLeft () == None:
           return False
       if not self.getLeft ().equal (t.getLeft ()):
           return False

   if self.getRight () == None:
       if not t.getRight () == None:
           return False
   else:
       if t.getRight () == None:
           return False
       if not self.getRight ().equal (t.getRight ()):
           return False

   return True

2. For each of the subquestions, express your answer as a aymptotically tight (Θ) bound and briefly justify your answer. Use N to represent the number of nodes in the input tree. Assuming the Python interpreter implements procedure calls in a straightforward way (that is, it does not do any transformations to optimize tail recursive calls).

What is the worst case running time of your equal method?
Θ(N) — the worst case is when the trees are equal (or the only difference is in the rightmost leaf) and every node must be compared. This will involve N calls to the equal method. The work for each call is constant. It involves lots of comparisons, but no work that scales with the tree size.
What is the best case running time of your equal method?
O(1) — if the root nodes are unequal, only one comparison is needed (to reach the first return False), so the running time is constant and does not scale with the size of the tree.
What is the worst case space usage of your equal method?
The space for each call is constant (no local variables are used), so the space scales with the number of calls that might be on the stack. The worst case occurs with the tree is completely unbalanced, so its height is N. In this case, we could have N recursive calls active as we walk down the tree, so the worst case space usage is Θ(N).
What is the worst case space usage of your equal method if the input trees are both well-balanced?
If the trees are well balanced, the height of the trees are Θ(log N), so the maximum recursive depth is Θ(log N), and the worst case space usage (for balanced trees) is Θ(log N).

3. Define a method isomorphic in the Tree class that takes a tree as its parameter, and evaluates to true if and only if the input tree is isomorphic to self. Two trees are considered isomorphic if their root nodes are equal, and each node in the tree either (1) has a left child that is isomorphic to the left child of the corresponding node in the self tree and has a right child that is isomorphic to the right childe of the corresponding node in the self tree; or (2) has a left child that is isomorphic to the right child of the corresponding node in the self tree and has a right child that is isomorphic to the left child of the corresponding node in the self tree. (The intuition behind our definition is the two trees would be equal if you could swap left and right children.)

There are lots of possibilities here. One is to modify our equal definition to add the isomorphic cases. This would be pretty complex however, especially since we have to deal specially with None children.

So, instead we implement a somewhat simpler approach:

    def numChildren(self):
       num = 0
       if not self.__left == None:
           num += 1
       if not self.__right == None:
           num += 1
       return num
       
    def isomorphic(self, t):
       if not self.__value == t.__value:
           return False
       
       if not self.numChildren () == t.numChildren ():
           return False
       
       if self.numChildren () == 0:
           return True
       elif self.numChildren () == 1:
           schild = self.getLeft ()
           if schild == None:
               schild = self.getRight ()
           tchild = t.getLeft ()
           if tchild == None:
               tchild = t.getRight ()
           return schild.isomorphic (tchild)
       else:
           return (self.getLeft ().isomorphic (t.getLeft ()) \
                   and self.getRight ().isomorphic (t.getRight ())) \
                   or \
                  (self.getLeft ().isomorphic (t.getRight ()) \
                   and self.getRight ().isomorphic (t.getLeft ()))

Another option would be to use equal and swap children in our comparisons, but repair them after. This is risky — we need to know there is no other code running concurrently that might observe the tree in its altered state. It does make the code simpler, however:

def isomorphic(self, t):
   if self.equal(t):
       return True
   else:
       (self.__left, self.__right) = (self.__right, self.__left)
       res = self.equal(t)
       (self.__right, self.__left) = (self.__left, self.__right)
       return res

4. Define a method, iterEqual in the Tree class that takes a tree as its parameter, and evaluates to true if and only if the input tree is equal to self (with the same behavior as the equal method in question 1). Your definition should not be recursive (it cannot use any recursive calls, but may use looping control structures such as for or while).

This is pretty tricky. We need to find a way to keep track of the state of the comparison. With the recurisve definition, Python's runtime stack does this for us. If we can't use recursion, though, we need to keep track of this ourselves. Our strategy is to maintain a list of pairs of nodes that remain to be checked.
    def equalIter(self, t):
        print "equalIter: " + str(self) + " / " + str(t)
        nodes = [[self, t]]
        
        while not len(nodes) == 0:
            nnodes = []
            for pair in nodes:
                print "Checking pair: " + str(pair[0]) + " / " + str(pair[1])
                if pair[0] == None:
                    if not pair[1] == None:
                        return False
                elif pair[1] == None:
                    return False
                else:
                    if pair[0].getValue () != pair[1].getValue ():
                       return False
                    nnodes.append ([pair[0].getLeft (), pair[1].getLeft ()])
                    nnodes.append ([pair[0].getRight (), pair[1].getRight ()])
            nodes = nnodes
            
        return True
Note that the code is actually simpler than our recursive code because we don't need as much special code for handling the None cases.

5. For each of the subquestions, express your answer as a aymptotically tight (Θ) bound and briefly justify your answer. Use N to represent the number of nodes in the input tree. Assuming the Python interpreter implements procedure calls in a straightforward way (that is, it does not do any transformations to optimize tail recursive calls).

What is the worst case running time of your iterEqual method?
Θ(N) — The maximum number of iterations of the while loop is N, in the case where all the nodes are equal. The easiest way to see this is noticing that each node value must be compared once. The running time of each operation is constant. This assumes the list append and access operations are all O(1).
What is the best case running time of your iterEqual method?
As in 2b, O(1).
What is the worst case space usage of your iterEqual method?
Since there are no recursive calls, the stack depth is constant. But, iterEqual uses memory to store the nnodes list. The space needed to store a list scales linearly in the number of elements in the list. So, we need to figure out the longest list it could be.
The nnodes list contains the number of nodes at a given depth of the tree, so its maximum length is the maximum number of nodes at any tree depth. This is maximized for a well balanced tree as the number of leaves in the tree (which are all at the same depth in a well balanced tree). The maximum number of leaves in a tree of N nodes is N/2. So, the memory use is in Θ(N).
What is the worst case space usage of your iterEqual method if the input trees are both well-balanced?
That is the worst case, Θ(N) as explained above.

6. The provided insert method has expected running time in Θ(N) where N is the number of entries in the table. (We are optimistically assuming the Python slicing and access operations are in O(1).) Define an insert method that has expected running time in Θ(log N).

We use the same search strategy as in lookup to find the correct insertion position:

    def insert(self, key, value):
        def insertposition(low, high):
            if (low >= high):
                return low
            middle = (low + high) / 2
            if key < self.items[middle].key:
                return insertposition (low, middle)
            elif key > self.items[middle].key:
                return insertposition (middle + 1, high)
            else:
                print "ERROR! Duplicate key"
                assert (False)

        pos = insertposition (0, len(self.items))
        self.items.insert (pos, Record (key,value))

This has expected running time in Θ(log N) since each recursive call to insertposition halves the number of locations that are under consideration.

7. Ari Tern suggest replacing the implementation of lookup with this implementation (tlookup in ContinuousTable.py):

    def tlookup(self, key):
        def lookuprange(items):
            if len(items) == 0: return None
            if len(items) == 1:
                if items[0].key == key:
                    return items[0].value
                else:
                    return None
            split1 = len(items) / 3
            split2 = 2 * len(items) / 3
            
            if key < items[split1].key:
                return lookuprange (items[:split1])
            elif key < items[split2].key:
                return lookuprange (items[split1:split2])
            else:
                return lookuprange (items[split2:])
                
        return lookuprange(self.items)

Is this a good idea? (A good answer will consider the affect of Ari's change on both the asymptotic and absolute properties of the procedure.)

The tlookup implementation requires fewer recusive calls to lookuprange than was required with lookup since each call eliminates two thirds of the items from consideration, instead of just one half. This means the number of expected calls is log₃ N instead of log₂ N. Within our ordernotation, this doesn't matter, though, since changing the base of a log only alters the value by a constant factor. So, the aymptotic running time is still in Θ(log N).
The actual running time, however, will be affected. We argued in the previous paragraph that the number of calls is reduced from log₂N to log₃N. In Lecture 5, we saw
log_b x = log_a x / log_a b
So, this reduces the number of calls by log₂ 3 = 0.63. The cost is an increase in the size (and complexity) of the code, and an increase in the running time of each call. We can estimate the running time increase by the number of expected comparisons. In the original code, one comparison is always needed (key < items[middle].key). (We ignore the end cases where the length is 0 or 1 since these are only encountered once.) In the modified code, this is more complex. We always make the first comparison (key < items[split1].key). If it is true, we are done. Otherwise, we need to make the second comparison. Assuming the calls to lookup are evenly distributed over the list, we expect the first comparison to be true only 1/3 of the time. Hence, the expected number of comparisons is 1 + 2/3. If our assumption that comparisons dominate the running time, then the expected running time is 0.63 * (1 + 2/3) = 1.05 the running time of lookup. So, we would expect it to be slightly slower, but after accounting for the overhead of the calls and the other work, this would be reduced. Hence, the change is a bad idea. There is no likely performance improvement (and a possible reduction), and the size of the code has increased.

8. Construct a simple example where the greedy algorithm does not find the best phylogeny. Explain why the greedy algorithm does not find the best possible phylogeny for your example.

We need to find an example where the best possible phylogeny does not match the one found by the greedy algorithm. Any case where the best phylogeny does not directly connect the two elements with the highest goodness score would satisfy this, since we know the greedy algorithm would connect those elements.

9. What is the asymptotic running time of the greedy phylogeny algorithm? Explain your reasoning clearly and any assumptions you make.

The greedy algorithm could be implemented with a running time in Θ(n²) where n is the number of species in the input set.
We need to first compute the goodness matrix. This involves computing the best alignment of each pair of sequences. There are n² cells to fill. If we use the Needleman-Wunsch algorithm (Lecture 4), each one requires work in Θ(|U||V|). If we assume the lengths of the input genomes do not scale (that is, we are concerned with n scaling, but the genome lengths are bounded), this is constant time related to the input size (which is measured in the number of species in the input set).
Then, we execute the greedy algorithm. Finding the best initial pair requires running time in O(n²) assuming we can access each cell in the matrix in constant time. We just need to look at all the cells to find the best goodness score.
Adding each element requires considering all remaining elements (there are O(n) of them). For each one, we need to consider all possible tree locations where it could be added. This scales with the number of nodes in the current tree — each node can have at most two children to consider. The number of nodes in the tree is up to n. This is O(n²). For each, we need to compute the total goodness score. If we use the result from the previous tree, though, we can compute this by just adding the new goodness score to the old score, so this can be done in constant time.
Hence, the total running time is in O(n²).

CS216: Program and Data Representation
University of Virginia

David Evans
evans@cs.virginia.edu
Using these Materials