[an error occurred while processing this directive]
|
Problem Set 2 Phylogeny |
Out: 30 January Due: 8 February (11am) |
Collaboration Policy - Read Carefully
For this assignment, you should do the first two parts (questions 1-5) on your own, and then meet with your assigned partner. Assigned partners will be emailed to the course list on Monday, January 30.
When you meet with your partner, you should first discuss your answers to the first two parts to arrive at a consensus best answer for each question. The consensus answer is the only answer you will turn in. Then, you should work as a team on the final part (questions 6-10). When you are working as a team, both partners should be actively involved all the time and you should take turns driving (who is typing at the keyboard).
You may consult any outside resources including books, papers, web sites and people, you wish for information on Python programming. Unlike Problem Set 1, you should feel free to conduct web searches or look at reference material on Sequence Alignment, Phylogeny, and related problems as you wish. You are also encouraged to discuss these problems with students in the class, including (but not limited to) your assigned partner.
You are strongly encouraged to take advantage of the staffed lab hours (which will be posted on the CS216 web site).
Purpose
| LinkedList.py | ContinuousList.py | |||
| Operation | Running Time | Memory | Running Time | Memory |
|---|---|---|---|---|
| length(self) | Θ(n) | Θ(1) | Θ(1) | Θ(1) |
| __init__(self) | ||||
| access(self,index) | ||||
| append(self,value) | ||||
| __str__(self) | ||||
We can define a mutable abstract datatype with operations as follows (note that the first three operations are identical to those for the immutable list datatype from Lecture 3:
Consider our alignment code from PS1, excerpted below:
def bestAlignment (U, V, c, g):
if len(U) == 0 or len(V) == 0:
...
else:
# try three possibilities:
(U0, V0) = bestAlignment (U[1:], V[1:], c, g)
...
(U1, V1) = bestAlignment (U, V[1:], c, g)
...
(U2, V2) = bestAlignment (U[1:], V, c, g)
...
# pick the best one
Although this is a clear way of finding the best alignment, as discussed
in Lecture 4 it is very
inefficient. So inefficient, that we cannot find alignments for
non-trivial strings.
The modified code is found in DynAlign.py and shown below. The key changes are bolded:
def bestAlignment (U, V, c, g):
def memoBestAlignment (U, V, c, g):
def makeKey (U, V):
return U + "%" + V
if memo.has_key(makeKey (U,V)):
res = memo[makeKey (U,V)]
return res[0], res[1]
if len(U) == 0 or len(V) == 0:
while len(U) < len(V): U = U + GAP
while len(V) < len(U): V = V + GAP
resU = U
resV = V
else:
# try with no gap
(U0, V0) = memoBestAlignment (U[1:], V[1:], c, g)
scoreNoGap = goodnessScore (U0, V0, c, g)
if U[0] == V[0]: scoreNoGap += c
# try inserting a gap in U (no match for V[0])
(U1, V1) = memoBestAlignment (U, V[1:], c, g)
scoreGapU = goodnessScore (U1, V1, c, g) - g
# try inserting a gap in V (no match for U[0])
(U2, V2) = memoBestAlignment (U[1:], V, c, g)
scoreGapV = goodnessScore (U2, V2, c, g) - g
if scoreNoGap >= scoreGapU and scoreNoGap >= scoreGapV:
resU = U[0] + U0
resV = V[0] + V0
elif scoreGapU >= scoreGapV:
resU = GAP + U1
resV = V[0] + V1
else:
resU = U[0] + U2
resV = GAP + V2
memo[makeKey(U,V)] = [resU, resV]
return resU, resV
memo = {}
return memoBestAlignment (U, V, c, g)
The Tree of Life project is developing a phylogeny for organisms on Earth. If you are unsure of your place in the univere, try staring from Life on Earth and walking down the tree to find Homo sapiens.
The way biologists (or linguists) determine evolutionary relationships is to look for similarities and differences between species (or languages). This is done by identifying a set of features that describe properties of a species or language. For species, the features might be phenotypic properties (e.g., do organisms have wings or gills?) or genotypic properties (the DNA sequence). Genotypic properties are likely to produce more accurate results, since small changes in genomes can produce large phenotypic changes. Note that this is a historical study. It can rarely provide definitive proof of a particular relationship, but a preponderance of evidence can make one explanation appear to be the most likely.
If two species have similar genomes, it is likely they evolved from a relatively recent comon ancestor. Biologists measure the similarity of genomes based on the number and likelihood of different kinds of mutations — base pairs may be inserted, deleted, duplicated, moved, or substituted. The number of mutations necessary to match two genomes gives an indication of the likelihood that the species evolved from a common ancestor. For this assignment we will assume a very simple model: the only mutation is a substitution or a single base pair and all substitutions are equally likely.
One measure of which tree is the most likely to represent the actual evolution of a set of species is parsimony. The parsimony principle is that if there are two possible explanations for an observed phenomenon, the simpler explanation is most likely to be correct. In producing phylogenetic trees, parsimony means we should look for the tree that requires the fewest possible total number of mutations. The goodness scores of the best possible alignments of two nucleotide sequences are one way of measuring how related they are. So, our goal is to construct a tree that maximizes the total goodness score of all connected pairs.
For example, consider the set of species described by the genomes below (of course, these are not their real genomes!):
| Species | Sequence |
|---|---|
| Cat | catcat |
| Dog | gggggg |
| Feline | cccccc |
| Tiger | cccaat |
The goodness scores of the possible pairs (using the c=10, g=2 goodness metric from PS1) :
| Species | Cat | Dog | Feline | Tiger |
|---|---|---|---|---|
| Cat | - | 0 | 20 | 36 |
| Dog | 0 | - | 0 | 0 |
| Feline | 20 | 0 | - | 30 |
| Tiger | 36 | 0 | 30 | - |
Note that our goodness score metric is symmetric (that is goodness(a,b) = goodness(b,a)).
Our goal is to find likely evolutionary relationships among the species by maximizing the sum of the goodness scores of all direct relationships. For example, consider the tree:
Cat
/ \
/ \
Tiger Dog
/
/
Feline
The total goodness score is goodness(Cat, Tiger) +
goodness (Cat, Dog) + goodness (Dog, Feline) = 36.
This is a less likely phylogeny than,
Feline
/ \
/ \
Dog Tiger
/
/
Cat
which has a total goodness score of 66. Other trees have the same score,
but no tree has a higher score.
For the remaining questions on this assignment, and most of Problem Set 3, you will explore algorithms and data structures in the context of finding phylogenetic trees. Note that we have greatly simplified the actual problem of determining biological evolutionary relationships. In fact, many species evolved from common ancestors which are now extinct. So, a more realistic phlogeny program would need to insert additional nodes to find a likely tree.
Cat
Tiger
Dog
Feline
and
Feline
Dog
Tiger
Cat
Note that we do not need to distinquish between the left and right child
when a tree has only one child.
def children(self):
if not self.__left == None:
yield self.__left
if not self.__right == None:
yield self.__right
It will yield the left child (if there is one) the first iteration
through the loop, and the right child (if there is one) the second
iteration. When the generator exits, there are no more values to yield
and the calling loop terminates. A client uses it like this,
childsum = 0
for child in tree.children():
childsum += child.getValue ()
The generator defined below yields all possible two-part partitions of
the input list:
def allPossiblePartitions (items):
if len(items) == 1:
yield [items[0]], []
yield [], [items[0]]
else:
for left, right in allPossiblePartitions (items[1:]):
lplus = left[:]
lplus.insert (0, items[0])
yield lplus, right
rplus = right[:]
rplus.insert (0, items[0])
yield left, rplus
for p1, p2 in allPossiblePartitions (s): print p1, p2Use n to represent the number of elements in s. You may assume print is O(1).
findTree ({'feline':'cccccc', 'cat':'catcat', \
'tiger':'cccaat', 'dog':'gggggg'})
should produce the all trees with maximal goodness score (66), including
the tree above. (The number of trees is 60, if we count isomorphic
trees where the trees would be identical if the left and
right children are swapped. A better solution would remove these
isomorphically equivalent trees (leaving 11 distinct trees), since there
is no different meaning associated with the left and right children. It
is acceptable for a
"green star" level solution to this question to include trees that are
isomorphically equivalent in your output.)
|
CS216: Program and Data Representation University of Virginia |
David Evans evans@cs.virginia.edu Using these Materials |