Problem Set 2Phylogeny |
Out: 30 January Due: 8 February (11am) |

**Collaboration Policy - Read Carefully**

For this assignment, you should do the first two parts (questions 1-5) on your own, and then meet with your assigned partner. Assigned partners will be emailed to the course list on Monday, January 30.

When you meet with your partner, you should first discuss your answers to the first two parts to arrive at a consensus best answer for each question. The consensus answer is the only answer you will turn in. Then, you should work as a team on the final part (questions 6-10). When you are working as a team, both partners should be actively involved all the time and you should take turns driving (who is typing at the keyboard).

You may consult any outside resources including books, papers, web sites and people, you wish for information on Python programming. Unlike Problem Set 1, you should feel free to conduct web searches or look at reference material on Sequence Alignment, Phylogeny, and related problems as you wish. You are also encouraged to discuss these problems with students in the class, including (but not limited to) your assigned partner.

You are **strongly encouraged** to take advantage of the staffed lab
hours (which will be posted on the CS216 web site).

**Purpose**

- Learn to experimentally analyze the asymptotic efficiency of a data structure
- Understand two basic list datatype implementations
- Get experience understanding and using a tree data abstraction
- Develop and analyze a brute-force phylogeny algorithm

LinkedList.py | ContinuousList.py | |||

Operation | Running Time | Memory | Running Time | Memory |
---|---|---|---|---|

length(self) | Θ(n) | Θ(1) | Θ(1) | Θ(1) |

__init__(self) | ||||

access(self,index) | ||||

append(self,value) | ||||

__str__(self) |

We can define a mutable abstract datatype with operations as follows (note that the first three operations are identical to those for the immutable list datatype from Lecture 3:

- Access (
*L*,*i*) (corresponds to the`access(`method): returns*self*,*index*)*L*[*i*] - Length (
*L*) (corresponds to the`length(`method): returns |*self*)*L*| - MakeEmptyList() (corresponds to the
`__init__(`method): returns <>*self*) - Append (
*L*,*e*) (corresponds to the`append(`method): modifies the value of*self*,*value*)*L*. The value of*L*_{post}= <*l*_{0},*l*_{1}, ...,*l*_{|L|-1},*e*> where <*l*_{0},*l*_{1}, ...,*l*_{|L|-1}> are the elements of*L*before the call.

self.__items.append(value)

self.__len += 1

Consider our alignment code from PS1, excerpted below:

def bestAlignment (U, V, c, g): if len(U) == 0 or len(V) == 0: ... else: # try three possibilities: (U0, V0) = bestAlignment (U[1:], V[1:], c, g) ... (U1, V1) = bestAlignment (U, V[1:], c, g) ... (U2, V2) = bestAlignment (U[1:], V, c, g) ... # pick the best oneAlthough this is a clear way of finding the best alignment, as discussed in Lecture 4 it is very inefficient. So inefficient, that we cannot find alignments for non-trivial strings.

The modified code is found in `DynAlign.py` and shown below. The
key changes are bolded:

def bestAlignment (U, V, c, g): def memoBestAlignment (U, V, c, g): def makeKey (U, V): return U + "%" + Vif memo.has_key(makeKey (U,V)): res = memo[makeKey (U,V)] return res[0], res[1]if len(U) == 0 or len(V) == 0: while len(U) < len(V): U = U + GAP while len(V) < len(U): V = V + GAP resU = U resV = V else: # try with no gap (U0, V0) = memoBestAlignment (U[1:], V[1:], c, g) scoreNoGap = goodnessScore (U0, V0, c, g) if U[0] == V[0]: scoreNoGap += c # try inserting a gap in U (no match for V[0]) (U1, V1) = memoBestAlignment (U, V[1:], c, g) scoreGapU = goodnessScore (U1, V1, c, g) - g # try inserting a gap in V (no match for U[0]) (U2, V2) = memoBestAlignment (U[1:], V, c, g) scoreGapV = goodnessScore (U2, V2, c, g) - g if scoreNoGap >= scoreGapU and scoreNoGap >= scoreGapV: resU = U[0] + U0 resV = V[0] + V0 elif scoreGapU >= scoreGapV: resU = GAP + U1 resV = V[0] + V1 else: resU = U[0] + U2 resV = GAP + V2memo[makeKey(U,V)] = [resU, resV]return resU, resV memo = {} return memoBestAlignment (U, V, c, g)

The Tree of Life
project is developing a phylogeny for organisms on Earth. If you
are unsure of your place in the univere, try staring from Life on
Earth and walking down the tree to find *Homo
sapiens*.

The way biologists (or linguists) determine evolutionary relationships
is to look for similarities and differences between species (or
languages). This is done by identifying a set of features that describe
properties of a species or language. For species, the features might be
phenotypic properties (e.g., do organisms have wings or gills?) or
genotypic properties (the DNA sequence). Genotypic properties are
likely to produce more accurate results, since small changes in genomes
can produce large phenotypic changes. Note that this is a
*historical* study. It can rarely provide definitive proof of a
particular relationship, but a preponderance of evidence can make one
explanation appear to be the most likely.

If two species have similar genomes, it is likely they evolved from a relatively recent comon ancestor. Biologists measure the similarity of genomes based on the number and likelihood of different kinds of mutations — base pairs may be inserted, deleted, duplicated, moved, or substituted. The number of mutations necessary to match two genomes gives an indication of the likelihood that the species evolved from a common ancestor. For this assignment we will assume a very simple model: the only mutation is a substitution or a single base pair and all substitutions are equally likely.

One measure of which tree is the most likely to represent the actual
evolution of a set of species is *parsimony*. The parsimony
principle is that if there are two possible explanations for an observed
phenomenon, the simpler explanation is most likely to be correct. In
producing phylogenetic trees, parsimony means we should look for the
tree that requires the fewest possible total number of mutations. The
goodness scores of the best possible alignments of two nucleotide
sequences are one way of measuring how related they are. So, our goal
is to construct a tree that maximizes the total goodness score of all
connected pairs.

For example, consider the set of species described by the genomes below (of course, these are not their real genomes!):

Species | Sequence |
---|---|

Cat | catcat |

Dog | gggggg |

Feline | cccccc |

Tiger | cccaat |

The goodness scores of the possible pairs (using the *c*=10,
*g*=2 goodness metric from PS1) :

Species | Cat | Dog | Feline | Tiger |
---|---|---|---|---|

Cat | - | 0 | 20 | 36 |

Dog | 0 | - | 0 | 0 |

Feline | 20 | 0 | - | 30 |

Tiger | 36 | 0 | 30 | - |

Note that our goodness score metric is symmetric (that is goodness(*a*,*b*) =
goodness(*b*,*a*)).

Our goal is to find likely evolutionary relationships among the species by maximizing the sum of the goodness scores of all direct relationships. For example, consider the tree:

Cat / \ / \ Tiger Dog / / FelineThe total goodness score is goodness(Cat, Tiger) + goodness (Cat, Dog) + goodness (Dog, Feline) = 36.

This is a less likely phylogeny than,

Feline / \ / \ Dog Tiger / / Catwhich has a total goodness score of 66. Other trees have the same score, but no tree has a higher score.

For the remaining questions on this assignment, and most of Problem Set 3, you will explore algorithms and data structures in the context of finding phylogenetic trees. Note that we have greatly simplified the actual problem of determining biological evolutionary relationships. In fact, many species evolved from common ancestors which are now extinct. So, a more realistic phlogeny program would need to insert additional nodes to find a likely tree.

Cat Tiger Dog Felineand

Feline Dog Tiger CatNote that we do not need to distinquish between the left and right child when a tree has only one child.

def children(self): if not self.__left == None: yield self.__left if not self.__right == None: yield self.__rightIt will yield the left child (if there is one) the first iteration through the loop, and the right child (if there is one) the second iteration. When the generator exits, there are no more values to yield and the calling loop terminates. A client uses it like this,

childsum = 0 for child in tree.children(): childsum += child.getValue ()The generator defined below yields all possible two-part partitions of the input list:

def allPossiblePartitions (items): if len(items) == 1: yield [items[0]], [] yield [], [items[0]] else: for left, right in allPossiblePartitions (items[1:]): lplus = left[:] lplus.insert (0, items[0]) yield lplus, right rplus = right[:] rplus.insert (0, items[0]) yield left, rplus

for p1, p2 in allPossiblePartitions (s): print p1, p2Use

- What is its asymptotic running time?
- What is its memory usage?

findTree ({'feline':'cccccc', 'cat':'catcat', \ 'tiger':'cccaat', 'dog':'gggggg'})should produce the all trees with maximal goodness score (66), including the tree above. (The number of trees is 60, if we count isomorphic trees where the trees would be identical if the left and right children are swapped. A better solution would remove these isomorphically equivalent trees (leaving 11 distinct trees), since there is no different meaning associated with the left and right children. It is acceptable for a "green star" level solution to this question to include trees that are isomorphically equivalent in your output.)

CS216: Program and Data RepresentationUniversity of Virginia |
David Evansevans@cs.virginia.eduUsing these Materials |