Same as Problem Set 1, except with different partners: For this assignment you should work with your assigned partner (partner assignments are posted on the course blog post that announces this assignment). Before you start working with your partner, you should read through the assignment yourself and think about the questions. Both partners must fully understand everything you turn in and submit for this. Even if you believe you are finished with the assignment because you have successfully completed all the questions, you are not finished until your partner also understands everything in it.
Remember to follow the pledge you read and signed at the beginning of the semester. For this assignment, you may consult any outside resources, including books, papers, web sites and people, you wish except for materials from previous cs1120, cs150, and cs200 courses. You may consult an outside person (e.g., another friend who is a CS major but is not in this class) who is not a member of the course staff, but that person cannot type anything in for you and all work must remain your own. That is, you can ask general questions such as “can you explain definitions to me?” or “how do lists work in Scheme?”, but outside sources should never give you specific answers to problem set questions. If you use resources other than the class materials, lectures and course staff, explain what you used in your turn-in.
You are strongly encouraged to take advantage of the scheduled office hours for this course. See the course site for the schedule.
Your answers do not have to be correct to receive full credit, but you must clearly explain your reasoning. Assume that all the procedures from Chapter 5 are defined. Your answers do not have to be correct to receive full credit, but you must clearly explain your reasoning.
For the remainder of this problem set, you should work with your partner. First, compare and discuss your answers on Question 1. If you and your partner have different answers, try to decide which is correct. Then, download the PS2 code so you can check your answers in the interpreter.
For this problem set, you will develop procedures to solve some problems in genomics. The main problem we want to solve is known as the sequence alignment problem. It is used to compare genomes to understand how they might be related, both in terms of their function and how they might have evolved from common ancestors.
Most information in biology is encoded in DNA, a nucleic acid with two very useful properties for storing and processing information. The first is that DNA is very stable, so preserves information for a long time. This is why it was possible to sequence DNA from a woolly mammoth, even though it was over 300,000 years old.
The second property is that DNA can be reliably replicated. DNA is composed of two chains of nucleotides that are intertwined in a double helix. The chains are connected by chemical bonds. Each nucleotide is one of four bases: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). Because of their chemical structure, only certain pairs of the bases can form bonds — Adenine bonds with Thymine, and Cytosine bonds with Guanine. We call bases that bond with each other complementary, so the complement of A is T, the complement of T is A, the complement of C is G, and the complement of G is C. These bonds connect the two strands of the double helix, so the two chains are redundant. The second chain contains no extra information, it can be completely determined by the first chain because of the chemical bonds. For example, anywhere we know there is an A in the first chain, we know there is a T in the other chain. This means that DNA can be copied by unraveling the strands to produce two separate strands. The nucleotides in each strand will then form chemical bonds with their complementary bases, thus forming a new double-stranded DNA that is equivalent to the first.
We will use the symbols, 'A, 'C, 'G, and 'T to represent the four bases. A symbol is just a single quote followed by a string of one or more charaters. You can’t do much with symbols, but you can test if to symbols are the same using eq?, which is all we need. For example,
(eq? 'A 'A)
evaluates to #t, and
(eq? 'C 'G)
evaluates to #f.
You should get these interactions:
> (nucleotide-complement 'A) T > (nucleotide-complement 'G) C > (nucleotide-complement 'C) G > (nucleotide-complement 'T) A
To make it easier to work with DNA sequences, we have provided a procedure make-nucleotide-sequence in genomes.rkt that takes as input a string (e.g., "ACAT") and outputs a list of base symbols corresponding to the input string (for the example, (A C A T), which is a list of four symbols, 'A, 'C, 'A, 'T, but when DrRacket prints out symbols it does not include the '). You do not need to modify make-nucleotide-sequence, but understanding it will help you with the next question.
To define make-nucleotide-sequence, we first define a procedure that takes a single character (#\A is how Scheme represents a single A character) as its input, and outputs the corresponding base symbol:
(define (base-from-char c) (if (eq? c #\A) 'A (if (eq? c #\T) 'T (if (eq? c #\G) 'G (if (eq? c #\C) 'C (error "Bad char: " c))))))
(Note: the actual definition in genome.rkt is a bit more complex to also handle lower-case characters.)
Then, to define make-nucleotide-sequence to take as input a string and output the corresponding list of symbols, we use the list-map procedure (defined in Chapter 5) to apply base-from-char to every character in the input string. To convert the string to a list of characters, we use the built-in procedure string->list.
(define (make-nucleotide-sequence s) (list-map base-from-char (string->list s)))
This is already defined in the provided genome.rkt. After you run ps2.rkt (which loads genome.rkt(make-nucleotide-sequence "ACAT") and get the result (A C A T).
> (sequence-complement (list 'A 'C 'A 'T)) (T G T A) > (sequence-complement null) () > (sequence-complement (make-nucleotide-sequence "GATTACA")) (C T A A T G T)
(Hint: the body of my procedure for this is only 34 characters long.)
The DNA copying mechanism is quite reliable, making about 1 error for every 100,000 nucleotides. The human genome (that is, all of a human’s genetic information) is about 6 Billion bases long, though, so that means every time a human cell (which contains two versions of the genome, one inherited from each parent) divides there would be about 120,000 errors. Fortunately, biology also has mechanisms for error-correction that fix most of the copying mistakes. After the error correction, the remaining number of mistakes is about one per 100 million bases copied. (For more on this, see DNA Replication and Causes of Mutation, Nature Education, 2008.)
The mutations that remain after the error-correction are mostly inconsequential, but some of them change the proteins the cell produces in ways that change the organism. Most changes are disastrous, and leave a cell that cannot produce a normal organism. Very occasionally, the copying mistakes turn out to be useful and lead to an organism that is more likely to survive and reproduce than organisms whose DNA does not have the mutations. As a result of natural selection, the new, mutated, version of the DNA will become the dominant version.
A very rough measure of the distance between two nucleotide sequences is to just count the number of positions where the bases are different. This is known as the Hamming distance. (The Hamming distance is very useful for error-correcting codes and cryptography, but, as we will see soon, not actually very useful in genomics.)
In reality, not all of the copying mistakes are point mutations, however. It is also (relatively) common for there to be a copying error where one or more bases are skipped (a deletion), or added (an insertion) into the (mis-)copy. This makes it tougher to measure how closely related two sequences are. For example, a single base insertion at the beginning of a sequence makes every subsequent base a mismatch and the Hamming distance would be (nearly) the length of the rest of the sequence even though there was only a single copying mistake.
To account for this we need a more complex way of measuring distance, known as Levenshtein distance, or more simply, edit distance. The edit distance accounts for the three different types of copying errors that might occure: a base could be skipped in the copy (deletion), a base could be inserted between two of the original bases (insertion), or a base could be replaced with a different base (mutation). To understand how closely related two DNA sequences are, a biologists wants to know the number of copying errors that separate them. (In reality, if both sequences may have started from a common ancestor, the errors happened along the paths to both of the sequences. But, an insertion in one sequence is similar to a deletion in the other sequence, so it is still useful to just look for the number of copying errors between the two comparison sequences.)
The goal of sequence alignment is to find the minimum number of edits necessary to make two (or more) sequences identical. Biologists usually think of this as inserting gaps in the sequences to make them line up better (hence the name, alignment). For example, if the two sequences are catcatcgta and catagcatgt, if we line them up directly:
catcatcgta |||xxxxxxx catagcatgt
there are only three matching bases and 7 mismatches (for an edit score of 7). If we insert some gaps, more of the bases will match:
cat--catcgta |||xx|||x||x catagcat-gt-
Now, there are now 8 matches and 3 insertions (two in the top sequence and two in the bottom sequence), and the total number of edits is 4.
To find the best alignment, we want to select the gaps in a way that minimizes the total edit distance. For example, if the two sequences are "ACAT" and "AATC" there are many (infinitely many) possible sequences of copying errors that start from "ACAT" and lead to "AATC". For example, there would be four deletions followed by four insertions, or three mutations (replacing the last three bases), or one deletion (of the C) followed by one insertion (of the C at the end of the second sequence). The minimum number of copying errors between the two sequences is known as their edit distance. For the example, the edit distance is 2 since there is no way to transform "ACAT" into "AATC" with fewer than two copying errors, but there is a way to do it with two. (We are simplifying things to assume all types of copying errors are equally likely. In biology, this is not the cases, and it is more likely for a long sequence of bases to be inserted (or deleted) at once, than for the same number of point mutations. So, biological algorithms for measuring evolutionary distance such as the Smith-Waterman algorithm are a bit more complicated than this.)
You do not need to turn in your answer for this question, but should work out the answers by hand. This will be useful for making sure you understand how edit distance works, and for testing your procedure later.
An algorithm for finding the edit distance between two sequences, s and t, is to find the minimum of all possibilities. Consider the sequence as lists of elements, s is (s0 s1 s2 s3 ... sn) and t is (t0 t1 t2 ... tm) (note that we cannot assume the number of elements in each sequence is the same). So, to start we have three options for the first element:
We want to pick the option that leads to the minimum total edit distance. To know which is best, though, we need to know the edit distance of the sequences that remain after the choice. This is a recursive definition!
We want to find the value of the minimum of the three options:
The built-in min procedure will be helpful here. It takes any number of inputs, and evaluates to the value of the minimum input.
Try your procedure on the first four examples above. You can try it on the longer example also, but unless you were extraordinarly clever, it will probably not finish executing. If the execution is taking too long, you can use the Stop button at the top right of the DrRacket window to kill it.
We have a serious problem if our edit-distance procedure takes too long to execute on strings like "AAAAAAAAAAAAAAAAAA" (10 bases), since proteins contain hundreds of amino acids (each of which requires 3 bases to encode). If we want to solve any interesting problems in biology we need a edit-distance procedure that is much faster than this one.
Why is the edit-distance procedure you defined for Question 7 so slow? (Think about this yourself first before reading on.)
The reason the recursive definition of edit distance as described is so slow is because it is doing so much redundant work. For example, supposed were are computing (edit-distance "ACT" "ACT"). Using the three options, we want to find
(min (+ 1 (edit-distance "CT" "ACT")) (+ 1 (edit-distance "ACT" "CT")) (edit-distance "CT" "CT"))
To evaluate this application expression, the interpreter needs to evaluate all the subexpressions. The first operand subexpression is (+ 1 (edit-distance "CT" "ACT")) which requires evaluating (edit-distance "CT" "ACT"). As before, we need to find the minimum of the three options. In this case,
(min (+ 1 (edit-distance "T" "ACT")) (+ 1 (edit-distance "CT" "CT")) (+ 1 (edit-distance "T" "CT")))
Note that one of these, (+ 1 (edit-distance "CT" "CT")) is identical to the third option for the first edit-distance call. But, the interpreter doesn’t know that. It just follows the evaluation rules, re-evaluating the same expression every time it appears. In evaluating a big edit-distance application, there are quadrillions of identical evaluations of edit-distance, all of which are redundantly evaluated by the evaluator.
One solution is to store the results of previous evaluations, so they don’t need to be recomputed. This is known as memoizing.
We have provided a procedure, memoize in genomes.rkt that does this for procedure that take two inputs. It is not necessary for you to understand how memoize is defined (indeed, it uses some special forms that we have not yet introduced). The memoize procedure takes as input a procedure (that must be a procedure that takes two inputs, like edit-distance does) and outputs a new procedure. The output procedure is similar to the input procedure, except whenever it is evaluated it store the result in a table. If it is re-evaluated on the same inputs, instead of needing to evaluate the full function, it looks up the value from the previous evaluation in the table and evaluates to that value right away.
Here are some examples using memoize:
> (define memoized-add (memoize +)) > (memoized-add 3 4) 7 > (memoized-add 3 4) 7
For a simple function like +, memoizing doesn’t make it faster (in fact, it is slowed). We also can’t tell that it is memoized since + behaves like a proper mathematical function. Instead, try:
> (define memoized-hello (memoize (lambda (firstname lastname) (printf "Hello ~a ~a!~n" firstname lastname)))) > (memoized-hello "David" "Evans") Hello David Evans! > (memoized-hello "David" "Evans") > (memoized-hello "Thomas" "Jefferson") Hello Thomas Jefferson!
The second evaluation of (memoized-hello "David" "Evans") produces no output, since the memoized-hello procedure already knows the result from the recorded first evaluation and does not evaluate the (side-effect) procedure printf.
Note that the simple solution like,
(define memoized-edit-distance (memoize edit-distance))
unfortunately doesn’t work. The reason for this is it only memoizes the first call. The recursive calls to edit-distance inside its definition are not using the memoized-edit-distance, so the evaluation is still done completely instead of using memoization.
If you define memoized-edit-distance correctly, you should be able to evaluate,
(memoized-edit-distance apoe-normal apoe-bad)
to find the edit distance between the normal APOE gene and the E4 variant of the APOE gene associated with Alzheimer’s disease (it should take about 10-30 seconds). An individual who has the E4 variant in both copies of their genome is about 30 times more likely to develop Alzheimer’s disease by age 75, compared to individuals who have both copies of their genome with the normal variant. With no copies of the E4 variant, someone has a 7% chance of developing Alzheimer’s; with one E4 variant, there is a 14% chance; with both copies as E4 variant, the chance of developing Alzheimer’s by age 75 is around 80%.