Logo
  • Home
  • Classes
  • Conveying Computing
  • Exams
  • Fractal Gallery
  • Guides
  • Problem Sets
  • Syllabus

Problem Set 2: Sequence Alignment

Turn-in Checklist:You and your partner should together turn in one printed document and one (final) electronic submission.

  1. Bring in to class a stapled turn-in containing your written and printed answers by 11:01am Wednesday, 14 September (including the code you submitted electronically). In addition to your names, please include your UVA IDs (e.g., mst3k) in big block lettersat the top of your turn-in.
  2. Use our automatic adjudication service to submit a single Scheme file (a modification of ps2.rkt) containing all your code. You must define an author list containing the UVA IDs of both partners as its elements. Do this by 10:55am on Wednesday, 14 September.

Collaboration Policy

Same as Problem Set 1, except with different partners: For this assignment you should work with your assigned partner (partner assignments are posted on the course blog post that announces this assignment). Before you start working with your partner, you should read through the assignment yourself and think about the questions. Both partners must fully understand everything you turn in and submit for this. Even if you believe you are finished with the assignment because you have successfully completed all the questions, you are not finished until your partner also understands everything in it.

Remember to follow the pledge you read and signed at the beginning of the semester. For this assignment, you may consult any outside resources, including books, papers, web sites and people, you wish except for materials from previous cs1120, cs150, and cs200 courses. You may consult an outside person (e.g., another friend who is a CS major but is not in this class) who is not a member of the course staff, but that person cannot type anything in for you and all work must remain your own. That is, you can ask general questions such as “can you explain definitions to me?” or “how do lists work in Scheme?”, but outside sources should never give you specific answers to problem set questions. If you use resources other than the class materials, lectures and course staff, explain what you used in your turn-in.

You are strongly encouraged to take advantage of the scheduled office hours for this course. See the course site for the schedule.

Purpose

  • Practice programming with procedures, including using if, applications, and definitions.
  • Become familiar with Pairs and Lists and how they can be used to manage complex data.
  • Understand and construct recursive definitions.
  • Learn a little about how genomics.
This problem set assumes you have read the course book through the end of Section 5.4.

Becoming Pros at Cons

Question 1:
For this question you should not use DrRacket or any Scheme interpreter. Do not work with your partner (yet). For each fragment below, either:

  1. Explain why the fragment is not a valid Scheme expression; or,
  2. Predict what value the expression will evaluate to. If the expression evaluates to a compound data structure, draw a picture showing the structure. (“Draw the structure” means “draw a box-and-pointer diagram showing all cons pairs,” like in the book.)

Your answers do not have to be correct to receive full credit, but you must clearly explain your reasoning. Assume that all the procedures from Chapter 5 are defined. Your answers do not have to be correct to receive full credit, but you must clearly explain your reasoning.

  1. (cons (cons 1 2) 3)
  2. (cons 1 (cons 2 3))
  3. (car (cons (cons 1 2) null))
  4. (car (cdr (cons (cons 1 2) null)))
  5. (list? (car (list 1 2 3)))
  6. (list? (cdr (list 1 2)))
  7. (list-length (cdr (cdr (cdr (list 1 2 3)))))
  8. (list-length (list-append (list 1 1) (list 2 0)))
  9. (list-length (list-append (list 1 2 3) 4))
  10. (list-length (list-append (list 1 2 3) null))
  11. (list-map (lambda (x) x) (list 1 2 3))
  12. (car (list-map (lambda (x) (> x 2)) (list-map (lambda (x) (+ x 3)) (list 1 2 3))))

For the remainder of this problem set, you should work with your partner. First, compare and discuss your answers on Question 1. If you and your partner have different answers, try to decide which is correct. Then, download the PS2 code so you can check your answers in the interpreter.

Download: Download ps2.zip to your machine and unzip it into your cs1120 directory. It should create a ps2/subdirectory containing these files:

  • ps2.rkt— A template for your answers. You should do the problem set by editing this file.
  • ch5-code.rkt— Code for the list procedures from Chapter 5. You should already understand all this code from reading Chapter 5. (Yes, you should definitely have finished reading through Section 5.4 before this point!)
  • genomes.rkt — Provided code for this problem set. You should examine, but not modify, this file.
Question 2: After you have compared and discussed your Question 1 and 2 answers with your partner, evaluate each one in DrRacket to check your predictions. Remember to load ch5-code.rktfirst for the list procedure definitions. For each fragment that you mispredicted indicate the correct answer and what you misunderstood (including a new picture if the value is a compound data structure).If you address all fragments in Question 1, and for each fragment you either predicted the correct answer (in Question 1) or explained the correct answer (in Question 2) you will receive full credit for Questions 1-2.

Genomics Background

For this problem set, you will develop procedures to solve some problems in genomics. The main problem we want to solve is known as the sequence alignment problem. It is used to compare genomes to understand how they might be related, both in terms of their function and how they might have evolved from common ancestors.


Most information in biology is encoded in DNA, a nucleic acid with two very useful properties for storing and processing information. The first is that DNA is very stable, so preserves information for a long time. This is why it was possible to sequence DNA from a woolly mammoth, even though it was over 300,000 years old.

The second property is that DNA can be reliably replicated. DNA is composed of two chains of nucleotides that are intertwined in a double helix. The chains are connected by chemical bonds. Each nucleotide is one of four bases: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). Because of their chemical structure, only certain pairs of the bases can form bonds — Adenine bonds with Thymine, and Cytosine bonds with Guanine. We call bases that bond with each other complementary, so the complement of A is T, the complement of T is A, the complement of C is G, and the complement of G is C. These bonds connect the two strands of the double helix, so the two chains are redundant. The second chain contains no extra information, it can be completely determined by the first chain because of the chemical bonds. For example, anywhere we know there is an A in the first chain, we know there is a T in the other chain. This means that DNA can be copied by unraveling the strands to produce two separate strands. The nucleotides in each strand will then form chemical bonds with their complementary bases, thus forming a new double-stranded DNA that is equivalent to the first.

We will use the symbols, 'A, 'C, 'G, and 'T to represent the four bases. A symbol is just a single quote followed by a string of one or more charaters. You can’t do much with symbols, but you can test if to symbols are the same using eq?, which is all we need. For example,

(eq? 'A 'A)

evaluates to #t, and

(eq? 'C 'G)

evaluates to #f.

Question 3: Define a procedure, nucleotide-complement, that takes as input a single base represented by its symbol ('A, 'C, 'T, or 'G) and evaluates to the complement of that base. (Note: if you are stuck on this question, you may want to read ahead and look at how we defined base-from-char for some ideas.)

You should get these interactions:

> (nucleotide-complement 'A)
T
> (nucleotide-complement 'G)
C
> (nucleotide-complement 'C)
G
> (nucleotide-complement 'T)
A

To make it easier to work with DNA sequences, we have provided a procedure make-nucleotide-sequence in genomes.rkt that takes as input a string (e.g., "ACAT") and outputs a list of base symbols corresponding to the input string (for the example, (A C A T), which is a list of four symbols, 'A, 'C, 'A, 'T, but when DrRacket prints out symbols it does not include the '). You do not need to modify make-nucleotide-sequence, but understanding it will help you with the next question.

To define make-nucleotide-sequence, we first define a procedure that takes a single character (#\A is how Scheme represents a single A character) as its input, and outputs the corresponding base symbol:

(define (base-from-char c)
  (if (eq? c #\A)
      'A
      (if (eq? c #\T)
          'T
          (if (eq? c #\G)
              'G
              (if (eq? c #\C)
                  'C
                  (error "Bad char: " c))))))

(Note: the actual definition in genome.rkt is a bit more complex to also handle lower-case characters.)

Then, to define make-nucleotide-sequence to take as input a string and output the corresponding list of symbols, we use the list-map procedure (defined in Chapter 5) to apply base-from-char to every character in the input string. To convert the string to a list of characters, we use the built-in procedure string->list.

(define (make-nucleotide-sequence s)
  (list-map base-from-char (string->list s)))

This is already defined in the provided genome.rkt. After you run ps2.rkt (which loads genome.rkt(make-nucleotide-sequence "ACAT") and get the result (A C A T).

Question 4: Define a procedure, sequence-complement, that takes as input a list of symbols representing a DNA sequence, and outputs a list of symbols that are the complement of that sequence. For example,

> (sequence-complement (list 'A 'C 'A 'T))
(T G T A)
> (sequence-complement null)
()
> (sequence-complement (make-nucleotide-sequence "GATTACA"))
(C T A A T G T)

(Hint: the body of my procedure for this is only 34 characters long.)

Hamming Distance

The DNA copying mechanism is quite reliable, making about 1 error for every 100,000 nucleotides. The human genome (that is, all of a human’s genetic information) is about 6 Billion bases long, though, so that means every time a human cell (which contains two versions of the genome, one inherited from each parent) divides there would be about 120,000 errors. Fortunately, biology also has mechanisms for error-correction that fix most of the copying mistakes. After the error correction, the remaining number of mistakes is about one per 100 million bases copied. (For more on this, see DNA Replication and Causes of Mutation, Nature Education, 2008.)

The mutations that remain after the error-correction are mostly inconsequential, but some of them change the proteins the cell produces in ways that change the organism. Most changes are disastrous, and leave a cell that cannot produce a normal organism. Very occasionally, the copying mistakes turn out to be useful and lead to an organism that is more likely to survive and reproduce than organisms whose DNA does not have the mutations. As a result of natural selection, the new, mutated, version of the DNA will become the dominant version.

A very rough measure of the distance between two nucleotide sequences is to just count the number of positions where the bases are different. This is known as the Hamming distance. (The Hamming distance is very useful for error-correcting codes and cryptography, but, as we will see soon, not actually very useful in genomics.)

Question 5: Define a procedure, count-matches, that takes as inputs a list and a value, and outputs the number of elements in the list that match the value. For example, (count-matches null 'A) should evaluate to 0 and (count-matches (list 'A 'C 'A 'T) 'A) should evaluate to 2. (Note: This is not actually useful for defining Hamming distance, but this question will give you practice with an easier recursive procedure before trying Hamming distance.)
Question 6: Define a procedure, hamming-distance, that takes as inputs two lists and outputs the number of positions where the list elements are different. You may assume that both inputs are the same length (although it would be even better to define a procedure that does not assume this!). Here are some examples: (hamming-distance null null) ==> 0, (hamming-distance (list 'A 'C 'A 'T) (list 'A 'C 'T 'A)) ==> 2.

Edit Distance

In reality, not all of the copying mistakes are point mutations, however. It is also (relatively) common for there to be a copying error where one or more bases are skipped (a deletion), or added (an insertion) into the (mis-)copy. This makes it tougher to measure how closely related two sequences are. For example, a single base insertion at the beginning of a sequence makes every subsequent base a mismatch and the Hamming distance would be (nearly) the length of the rest of the sequence even though there was only a single copying mistake.

To account for this we need a more complex way of measuring distance, known as Levenshtein distance, or more simply, edit distance. The edit distance accounts for the three different types of copying errors that might occure: a base could be skipped in the copy (deletion), a base could be inserted between two of the original bases (insertion), or a base could be replaced with a different base (mutation). To understand how closely related two DNA sequences are, a biologists wants to know the number of copying errors that separate them. (In reality, if both sequences may have started from a common ancestor, the errors happened along the paths to both of the sequences. But, an insertion in one sequence is similar to a deletion in the other sequence, so it is still useful to just look for the number of copying errors between the two comparison sequences.)

The goal of sequence alignment is to find the minimum number of edits necessary to make two (or more) sequences identical. Biologists usually think of this as inserting gaps in the sequences to make them line up better (hence the name, alignment). For example, if the two sequences are catcatcgta and catagcatgt, if we line them up directly:

     catcatcgta
     |||xxxxxxx
     catagcatgt

there are only three matching bases and 7 mismatches (for an edit score of 7). If we insert some gaps, more of the bases will match:

     cat--catcgta
     |||xx|||x||x
     catagcat-gt-

Now, there are now 8 matches and 3 insertions (two in the top sequence and two in the bottom sequence), and the total number of edits is 4.

To find the best alignment, we want to select the gaps in a way that minimizes the total edit distance. For example, if the two sequences are "ACAT" and "AATC" there are many (infinitely many) possible sequences of copying errors that start from "ACAT" and lead to "AATC". For example, there would be four deletions followed by four insertions, or three mutations (replacing the last three bases), or one deletion (of the C) followed by one insertion (of the C at the end of the second sequence). The minimum number of copying errors between the two sequences is known as their edit distance. For the example, the edit distance is 2 since there is no way to transform "ACAT" into "AATC" with fewer than two copying errors, but there is a way to do it with two. (We are simplifying things to assume all types of copying errors are equally likely. In biology, this is not the cases, and it is more likely for a long sequence of bases to be inserted (or deleted) at once, than for the same number of point mutations. So, biological algorithms for measuring evolutionary distance such as the Smith-Waterman algorithm are a bit more complicated than this.)

Question 7:For each of the following pairs of sequences, manually figure out the edit distance.

  1. “ACAT”, “ACA”
  2. “AACCT”, “CCCCT”
  3. “GATTACA”, “GACACA”
  4. “ACAT”, “”
  5. “AAAAAAAAAAAAAAAAAA”, “AAAAAAAAAAAAAAAAAA”

You do not need to turn in your answer for this question, but should work out the answers by hand. This will be useful for making sure you understand how edit distance works, and for testing your procedure later.

An algorithm for finding the edit distance between two sequences, s and t, is to find the minimum of all possibilities. Consider the sequence as lists of elements, s is (s0 s1 s2 s3 ... sn) and t is (t0 t1 t2 ... tm) (note that we cannot assume the number of elements in each sequence is the same). So, to start we have three options for the first element:

  1. the base s0 was inserted in s. This has edit cost of 1 for the insertion, and means the sequences before the insert were (s1 s2 ... sn) and (t0 t1 t2 ... tm).
  2. the base t0 was inserted in t (or equivalently, there was some base x that was before s0 which was deleted). This has edit cost of 1 for the deletion, and means the sequences before the deletion were (s0 s1 s2 ... sn) and (t1 t2 ... tm).
  3. there was no insertion of deletion. This means s0 and t0 are aligned. If they match, the edit cost is 0 since they are the same no edit is needed. If they are different, the edit cost is 1 since there was a point mutation. After matching s0 and t0, the remaining sequences are (s1 s2 ... sn) and (t1 t2 ... tn).

We want to pick the option that leads to the minimum total edit distance. To know which is best, though, we need to know the edit distance of the sequences that remain after the choice. This is a recursive definition!

We want to find the value of the minimum of the three options:

  1. 1 + edit distance between (s1 s2 ... sn) and (t0 t1 t2 ... tm)(insert)
  2. 1 + edit distance between (s0 s1 s2 ... sn) and (t1 t2 ... tm)(delete)
  3. edit distance between (s1 s2 ... sn) and (t1 t2 ... tm) + 0 (if s0 matches t0) or 1 (if s0 does not match t0).

The built-in min procedure will be helpful here. It takes any number of inputs, and evaluates to the value of the minimum input.

Question 8: Define a procedure, edit-distance, that takes as input two lists and outputs the edit distance between those lists.Try this on your own first, but its pretty tricky. If you get stuck, there are some hints here and don’t forget to take advantage of the available help.

Try your procedure on the first four examples above. You can try it on the longer example also, but unless you were extraordinarly clever, it will probably not finish executing. If the execution is taking too long, you can use the Stop button at the top right of the DrRacket window to kill it.

Memoizing

We have a serious problem if our edit-distance procedure takes too long to execute on strings like "AAAAAAAAAAAAAAAAAA" (10 bases), since proteins contain hundreds of amino acids (each of which requires 3 bases to encode). If we want to solve any interesting problems in biology we need a edit-distance procedure that is much faster than this one.

Why is the edit-distance procedure you defined for Question 7 so slow? (Think about this yourself first before reading on.)

The reason the recursive definition of edit distance as described is so slow is because it is doing so much redundant work. For example, supposed were are computing (edit-distance "ACT" "ACT"). Using the three options, we want to find

(min (+ 1 (edit-distance "CT" "ACT"))
     (+ 1 (edit-distance "ACT" "CT"))
     (edit-distance "CT" "CT"))

To evaluate this application expression, the interpreter needs to evaluate all the subexpressions. The first operand subexpression is (+ 1 (edit-distance "CT" "ACT")) which requires evaluating (edit-distance "CT" "ACT"). As before, we need to find the minimum of the three options. In this case,

(min (+ 1 (edit-distance "T" "ACT"))
     (+ 1 (edit-distance "CT" "CT"))
     (+ 1 (edit-distance "T" "CT")))

Note that one of these, (+ 1 (edit-distance "CT" "CT")) is identical to the third option for the first edit-distance call. But, the interpreter doesn’t know that. It just follows the evaluation rules, re-evaluating the same expression every time it appears. In evaluating a big edit-distance application, there are quadrillions of identical evaluations of edit-distance, all of which are redundantly evaluated by the evaluator.

One solution is to store the results of previous evaluations, so they don’t need to be recomputed. This is known as memoizing.

We have provided a procedure, memoize in genomes.rkt that does this for procedure that take two inputs. It is not necessary for you to understand how memoize is defined (indeed, it uses some special forms that we have not yet introduced). The memoize procedure takes as input a procedure (that must be a procedure that takes two inputs, like edit-distance does) and outputs a new procedure. The output procedure is similar to the input procedure, except whenever it is evaluated it store the result in a table. If it is re-evaluated on the same inputs, instead of needing to evaluate the full function, it looks up the value from the previous evaluation in the table and evaluates to that value right away.

Here are some examples using memoize:

> (define memoized-add (memoize +))
> (memoized-add 3 4)
7
> (memoized-add 3 4)
7

For a simple function like +, memoizing doesn’t make it faster (in fact, it is slowed). We also can’t tell that it is memoized since + behaves like a proper mathematical function. Instead, try:

> (define memoized-hello (memoize (lambda (firstname lastname) (printf "Hello ~a ~a!~n" firstname lastname))))
> (memoized-hello "David" "Evans")
Hello David Evans!
> (memoized-hello "David" "Evans")
> (memoized-hello "Thomas" "Jefferson")
Hello Thomas Jefferson!

The second evaluation of (memoized-hello "David" "Evans") produces no output, since the memoized-hello procedure already knows the result from the recorded first evaluation and does not evaluate the (side-effect) procedure printf.

It is not necessary to answer the next question to achieve “gold”-star level performance on this problem set. We hope ambitious students will attempt to solve this question, and some will be able to do it, but if you have been able to solved all the previous questions well you have already done enough to receive a gold star on this assignment. (Regardless of whether or not you answer this question, don’t forget to follow the submission instructions at the bottom of this assignment.)
Question 9: Define a procedure, memoized-edit-distance, that computes the edit distance efficiently using memoize to avoid re-computing redundant evaluations.

Note that the simple solution like,

(define memoized-edit-distance (memoize edit-distance))

unfortunately doesn’t work. The reason for this is it only memoizes the first call. The recursive calls to edit-distance inside its definition are not using the memoized-edit-distance, so the evaluation is still done completely instead of using memoization.

If you define memoized-edit-distance correctly, you should be able to evaluate,

    (memoized-edit-distance apoe-normal apoe-bad)

to find the edit distance between the normal APOE gene and the E4 variant of the APOE gene associated with Alzheimer’s disease (it should take about 10-30 seconds). An individual who has the E4 variant in both copies of their genome is about 30 times more likely to develop Alzheimer’s disease by age 75, compared to individuals who have both copies of their genome with the normal variant. With no copies of the E4 variant, someone has a 7% chance of developing Alzheimer’s; with one E4 variant, there is a 14% chance; with both copies as E4 variant, the chance of developing Alzheimer’s by age 75 is around 80%.

Automatic Testing: Submit a single file containing all the code for your answers to the Alonzo-Bot: https://church.cs.virginia.edu/cs1120/ps2.html. The server will run a series of tests on your submission and report the result. You may submit as many times as you want until you are satisfied with your test results; only the last submission counts. You may submit using either of your NetBadge logins, and only the last submission submitted by either one of your logins will be used for grading. Your scheme file should be a modification of the ps2.rkt file. Remember that you also still need to submit a paper document containing your written and printed answers, including the code you submitted electronically.
Credits: This problem set was originally developed for UVa cs1120 Fall 2011 by David Evans with help from Ivan Alagenchev, Jonathan Burket, Peter Chapman, Jiamin Chen, Joseph Featherston, Valerie Sapp, and Kristina Shichanin. Cameron Mura provided assistance with the biology (but is not responsible for any biological mistakes or gross oversimplifications here!).
Print Friendly Print Get a PDF version of this webpage PDF

4 Responses to “Problem Set 2: Sequence Alignment”

  1. Joshua Whelan says:
    9 September 2011 at 2:25 pm

    The hint for question is not loading and giving a message saying “Page not found.”

    Log in to Reply
    • David Evans says:
      9 September 2011 at 4:50 pm

      Thanks for posting. I’ve put up two hints now. Hopefully these will be helpful for getting started on this.

      Log in to Reply
  2. Kevin Liu says:
    9 September 2011 at 8:45 pm

    Do we need to increase the size of the memory available? The programs that I keep trying say that they’ve run out of memory.

    Log in to Reply
    • David Evans says:
      9 September 2011 at 9:38 pm

      You shouldn’t need to increase the memory size for the small edit distance problems. For the larger one, if you are not doing the memoization, no matter how big you make the memory it will still run out. If you are running out of memory on the small problems, you most likely have a recursive definition that is actually circular (it is not making progress towards the base case, so you are just using more and more memory until it runs out).

      Log in to Reply

Leave a Reply Cancel reply

You must be logged in to post a comment.

cs1120 | RSS | Comments RSS | Book | Using These Materials | Login | Admin | Powered by Wordpress