[an error occurred while processing this directive]
|
Problem Set 3 Phaster Phylogeny |
Out: 9 February Due: 15/17 February (see below) |
Collaboration Policy - Read Carefully
For this assignment, you may work on your own or with any one other person of your choice. If you work with a partner, you should turn in one assignment with both of your names on it. Keep in mind that the main purpose of this assignment is to help you prepare for Exam 1. So, you should decide to work alone or with a partner based on which approach you believe will be most helpful to you in learning the material it covers.
You may consult any outside resources including books, papers, web sites and people you wish. You are also encouraged to discuss these problems with students in the class.
You are strongly encouraged to take advantage of the staffed lab hours posted on the CS216 web site.
Purpose
def tlookup(self, key):
def lookuprange(items):
if len(items) == 0: return None
if len(items) == 1:
if items[0].key == key:
return items[0].value
else:
return None
split1 = len(items) / 3
split2 = 2 * len(items) / 3
if key < items[split1].key:
return lookuprange (items[:split1])
elif key < items[split2].key:
return lookuprange (items[split1:split2])
else:
return lookuprange (items[split2:])
return lookuprange(self.items)
Is this a good idea? (A good answer will consider the effect of Ari's
change on both the asymptotic and absolute properties of the procedure.)
The problem of finding the best phylogeny for a set of sequences is known to be NP-Complete (don't worry if you dont know what this means yet, we will cover it later). This means that it is unlikely that any solution asymptotically better than trying all possible trees exists. (If a faster approach is found, it would mean that lots of other believed to be hard problems could also be solved quickly.) So, to solve phylogeny construction problems of a non-trivial size, we need to make compromises. We tradeoff the guarantee of finding the best phylogeny, for the practicality of finding a phylogeny that is likely to be reasonably good quickly.
The approach we will use is an example of a greedy algorithm. A greedy algorithm makes the locally optimal solution first and at each successive step. This strategy is fast, since it only involves considering each immediate possibility, instead of considering all possibilities for the entire solution. However, it is not guaranteed to lead to a globally optimal solution (in this case, it might not find the best possible phylogeny).
The algorithm we will use is a simple version of the UPGMA (unweighted pair group method with arithmetic mean) algorithm (which is a bit simpler than the most popular current phylogeny construction algorithms).
The idea behind UPGMA is to greedily form groupings by forming subtrees by connecting the most similar sequences at every step. We start by computing a table of the goodness scores of all pairs of sequences. Then, we find the two elements with the highest goodness score, and connect them (one element will be the parent and the other its left child). Then, we add the other elements to the tree greedily — with each step we find the addition with the maximal parsimony score possible (without altering the existing tree). Each iteration considers all remaining elements in the set, and all possible positions in the tree where they could be added — as a new root (with the existing root as its left child) and as a child of any node that does not already have two children. We continue in this manner until all nodes are added to the tree.
For example, consider the example from PS2 with goodness matrix:
| Species | Cat | Dog | Feline | Tiger |
|---|---|---|---|---|
| Cat | - | 0 | 20 | 36 |
| Dog | 0 | - | 0 | 0 |
| Feline | 20 | 0 | - | 30 |
| Tiger | 36 | 0 | 30 | - |
Our greedy algorithm will start by linking the two elements with the highest goodness score:
Tiger CatOf course it is symmetric, so we could also do,
Cat TigerNext, we will add another element to the tree. First, we consider adding Feline. There are 3 possibilities:
Feline
Tiger
Cat
goodness = 30 + 36 = 66
Tiger
Cat
Feline
goodness = 36 + 20 = 56
Tiger
Cat
Feline
goodness = 36 + 30 = 66
We also consider adding Dog:
Dog
Tiger
Cat
goodness = 0 + 36 = 36
Tiger
Cat
Dog
goodness = 36 + 0 = 36
Tiger
Cat
Dog
goodness = 36 + 0 = 36
Of the six trees we consider, the best are the 1st and 3rd (with equally
good scores of 66). So, we greedily pick one of them (say the 1st) and
continue.
Now, we have one element left to add. We consider all possibilities of adding dog to the tree:
Dog
Feline
Tiger
Cat
Feline
Tiger
Cat
Dog
Feline
Tiger
Cat
Dog
Feline
Tiger
Cat
Dog
All of them are equally good (since the goodness score of Dog with any
other element is 0).
In this case we were lucky — the greedy algorithm found the best possible phylogeny with far less work than the brute force algorithm. However, the greedy algorithm is not guaranteed to always find the best phylogeny.
|
CS216: Program and Data Representation University of Virginia |
David Evans evans@cs.virginia.edu Using these Materials |