We will now use dynamic programming to tackle the harder problem of general sequence alignment. The algorithm we will develop in the following sections to solve sequence alignment is known as the NeedlemanWunsch algorithm.
Dynamic programming vs. memoization
Before we dive into the algorithm, a final note on memoization is in order. Much like the Fibonacci problem, the sequence alignment problem can be solved in either a topdown or bottomup approach.
In a topdown recursive approach we can use memoization to create a potentially large dictionary indexed by each of the subproblems that we are solving (aligned sequences). This requires (Oleft(n^{2} m^{2} ight)) space if we index each subproblem by the starting and end points of the subsequences for which an optimal alignment needs to be computed. The advantage is that we solve each subproblem at most once: if it is not in the dictionary, the problem gets computed and then inserted into dictionary for further reference.
In a bottomup iterative approach we can use dynamic programming. We define the order of computing subproblems in such a way that a solution to a problem is computed once the relevant subproblems have been solved. In particular, simpler subproblems will come before more complex ones. This removes the need for keeping track of which subproblems have been solved (the dictionary in memoization turns into a matrix) and ensures that there is no duplicated work (each subalignment is computed only once).
Thus in this particular case, the only practical difference between memoization and dynamic programming is the cost of recursive calls incurred in the memoization case (space usage is the same).
Problem Statement
Suppose we have an optimal alignment for two sequences (S_{1 ldots n}) and (T_{1 ldots m}) in which Si matches Tj. The key insight is that this optimal alignment is composed of an optimal alignment between (left(S_{1}, ldots, S_{i1} ight)) and (left(T_{1}, ldots, T_{i1} ight)) and an optimal alignment between (left(S_{i+1}, ldots, S_{n} ight)) and (left(T_{j+1}, ldots, T_{m} ight)). This follows from a cutandpaste argument: if one of these partial alignments is suboptimal, then we cutandpaste a better alignment in place of the suboptimal one. This achieves a higher score of the overall alignment and thus contradicts the optimality of the initial global alignment. In other words, every subpath in an optimal path must also be optimal. Notice that the scores are additive, so the score of the overall alignment equals the addition of the scores of the alignments of the subsequences. This implicitly assumes that the subproblems of computing the optimal scoring alignments of the subsequences are independent. We need to biologically motivate that such an assumption leads to meaningful results.
Index space of subproblems
We now need to index the space of subproblems. Let (F_{i,j}) be the score of the optimal alignment of (left(S_{1}, ldots, S_{i} ight)) and (left(T_{1}, ldots, T_{i} ight)). The space of subproblems is (left{F_{i, j}, i in[0,S], j in[0,T] ight}). This allows us to maintain an ( (m + 1) × (n + 1)) matrix F with the solutions (i.e. optimal scores) for all the subproblems.
Local optimality
We can compute the optimal solution for a subproblem by making a locally optimal choice based on the results from the smaller subproblems. Thus, we need to establish a recursive function that shows how the solution to a given problem depends on its subproblems. And we use this recursive definition to fill up the table F in a bottomup fashion.
We can consider the 4 possibilities (insert, delete, substitute, match) and evaluate each of them based on the results we have computed for smaller subproblems. To initialize the table, we set (F_{0,j} = −j · d) and (F_{i,0} = −i·d ) since those are the scores of aligning (left(T_{1}, ldots, T_{i} ight)) with j gaps and (left(S_{1}, ldots, S_{i} ight)) with (i) gaps (aka zero overlap between the two sequences). Then we traverse the matrix column by column computing the optimal score for each alignment subproblem by considering the four possibilities:
• Sequence S has a gap at the current alignment position.
• Sequence T has a gap at the current alignment position.
• There is a mutation (nucleotide substitution) at the current position.
• There is a match at the current position.
We then use the possibility that produces the maximum score. We express this mathematically by the recursive formula for (F_{i, j}):
[F(0,0)=0 onumber]
[Initialization : F (i, 0) = F (i − 1, 0) − d onumber ]
[F (0, j) = F (0, j − 1) − d onumber ]
[ ext {Iteration}: quad F(i, j)=max left{egin{array}{l}
F(i1, j)d ext { insert gap in S}
F(i, j1)d ext{ insert gap in T}
F(i1, j1)+sleft(x_{i}, y_{j}
ight) ext{ match or mutation}
end{array}
ight.
onumber ]
Termination : Bottom right
After traversing the matrix, the optimal score for the global alignment is given by (F_{m,n}). The traversal order needs to be such that we have solutions to given subproblems when we need them. Namely, to compute (F_{i,j}), we need to know the values to the left, up, and diagonally above (F_{i,j}) in the table. Thus we can traverse the table in row or column major order or even diagonally from the top left cell to the bottom right cell. Now, to obtain the actual alignment we just have to remember the choices that we made at each step.
Optimal Solution
Paths through the matrix (F) correspond to optimal sequence alignments. In evaluating each cell (F_{i,j}) we make a choice by selecting the maximum of the three possibilities. Thus the value of each (uninitialized) cell in the matrix is determined either by the cell to its left, above it, or diagonally to the left above it. A match and a substitution are both represented as traveling in the diagonal direction; however, a different cost can be applied for each, depending on whether the two base pairs we are aligning match or not. To construct the actual optimal alignment, we need to traceback through our choices in the matrix. It is helpful to maintain a pointer for each cell while filling up the table that shows which choice was made to get the score for that cell. Then we can just follow our pointers backwards to reconstruct the optimal alignment.
Solution Analysis
The runtime analysis of this algorithm is very simple. Each update takes (O(1)) time, and since there are mn elements in the matrix F, the total running time is (O(mn)). Similarly, the total storage space is O(mn). For the more general case where the update rule is more complicated, the running time may be more expensive. For instance, if the update rule requires testing all sizes of gaps (e.g. the cost of a gap is not linear), then the running time would be (O(mn(m + n))).
NeedlemanWunsch in practice
Assume we want to align two sequences S and T, where
S = AGT
T = AAGC
The first step is placing the two sequences along the margins of a matrix and initializing the matrix cells. To initialize we assign a 0 to the first entry in the matrix and then fill in the first row and column based on the incremental addition of gap penalties, as in Figure 2.9 below. Although the algorithm could fill in the first row and column through iteration, it is important to clearly define and set boundaries on the problem.
The next step is iteration through the matrix. The algorithm proceeds either along rows or along columns, considering one cell at time. For each cell three scores are calculated, depending on the scores of three adjacent matrix cells (specifically the entry above, the one diagonally up and to the left, and the one to the left). The maximum score of these three possible tracebacks is assigned to the entry and the corresponding pointer is also stored. Termination occurs when the algorithm reaches the bottom right corner. In Figure 2.10 the alignment matrix for sequences S and T has been filled in with scores and pointers.
The final step of the algorithm is optimal path traceback. In our example we start at the bottom right corner and follow the available pointers to the top left corner. By recording the alignment decisions made at each cell during traceback, we can reconstruct the optimal sequence alignment from end to beginning and then invert it. Note that in this particular case, multiple optimal pathways exist (Figure 2.11). A pseudocode implementation of the NeedlemanWunsch algorithm is included in Appendix 2.11.4
Optimizations
The dynamic algorithm we presented is much faster than the bruteforce strategy of enumerating alignments and it performs well for sequences up to 10 kilobases long. Nevertheless, at the scale of whole genome alignments the algorithm given is not feasible. In order to align much larger sequences we can make modifications to the algorithm and further improve its performance.
Bounded Dynamic Programming
One possible optimization is to ignore Mildly Boring Alignments (MBAs), or alignments that have too many gaps. Explicitly, we can limit ourselves to stay within some distance W from the diagonal in the matrix
F of subproblems. That is, we assume that the optimizing path in F from (F_{0,0} ) to ( F_{m,n} ) is within distance W along the diagonal. This means that recursion (2.2) only needs to be applied to the entries in F within distance W around the diagonal, and this yields a time/space cost of (O((m + n)W )) (refer to Figure 2.12).
Note, however, that this strategy is heuristic and no longer guarantees an optimal alignment. Instead it attains a lower bound on the optimal score. This can be used in a subsequent step where we discard the recursions in matrix F which, given the lower bound, cannot lead to an optimal alignment.
Linear Space Alignment
Recursion (2.2) can be solved using only linear space: we update the columns in F from left to right during which we only keep track of the last updated column which costs (O(m)) space. However, besides the score (F_{m,n}) of the optimal alignment, we also want to compute a corresponding alignment. If we use trace back, then we need to store pointers for each of the entries in F, and this costs (O(mn)) space.
It is also possible to find an optimal alignment using only linear space! The goal is to use divide and conquer in order to compute the structure of the optimal alignment for one matrix entry in each step. Figure 2.13 illustrates the process. The key idea is that a dynamic programming alignment can proceed just as easily in the reverse direction, starting at the bottom right corner and terminating at the top left. So if the matrix is divided in half, then both a forward pass and a reverse pass can run at the same time and converge in the middle column. At the crossing point we can add the two alignment scores together; the cell in the middle column with the maximum score must fall in the overall optimal path.
We can describe this process more formally and quantitatively. First compute the row index (u ∈ {1,...,m}) that is on the optimal path while crossing the ( frac{n}{2} th) column. For (1 ≤ i ≤ m ) and (n ≤ j ≤ n) let (C_{i,j}) denote the row index that is on the optimal path to (F_{i,j}) while crossing the( frac{n}{2} th) column. Then, while we update the columns of F from left to right, we can also update the columns of C from left to right. So, in (O(mn)) time and (O(m) )space we are able to compute the score (F_{m,n}) and also (C_{m,n}), which is equal to the row index ( u ∈ {1, . , m}) that is on the optimal path while crossing the ( frac{n}{2} th) column. Now the idea of divide and conquer kicks in. We repeat the above procedure for the upper left (u imes frac{n}{2} ) submatrix of F and also repeat the above procedure for the lower right ( (mu) imes frac{n}{2} ) submatrix of F . This can be done using ( O(m + n) ) allocated linear space. The running time for the upper left submatrix is ( Oleft(frac{u n}{2} ight) ) and the running time for the lower right submatrix is (Oleft(frac{(mu) n}{2} ight)), which added together gives a running time of ( Oleft(frac{m n}{2} ight)=O(m n)).
We keep on repeating the above procedure for smaller and smaller submatrices of F while we gather more and more entries of an alignment with optimal score. The total running time is ( O(m n)+Oleft(frac{m n}{2} ight)+Oleft(frac{m n}{4} ight)+ldots=O(2mn) = O(mn)). So, without sacrificing the overall running time (up to a constant factor), divide and conquer leads to a linear space solution (see also Section ?? on Lecture 3).
Details of Needleman–Wunsch algorithm
I try to understand details of the Needleman–Wunsch algorithm and use an example from the book [Nello Cristianini, Matthew W. Hahn. Introduction to Computational Genomics. A Case Studies Approach, www.computationalgenomics.net].
To my opinion, the algorithm is described not very neatly, so the question arises.
Also, the calculation of the score for diagonal neighbor is also clear,
because this is an alignment without a gap:
Unclear moment is the alignment for left (or above) neighbor. Say, the left neighbor
3.1 Alignment Algorithms and Dynamic Programming
One of the first attempts to align two sequences was carried out by Vladimir Levenstein in 1965, called “edit distance”, and now is often called Levenshtein Distance. The edit distance is defined as the number of single character edits necessary” to change one word to another. Initially, he described written texts and words, but this method was later applied to biological sequences. One of the most commonly used algorithms for computing the edit distance is the WagnerFischer algorithm, a Dynamic Programming algorithm.
Dynamic Programming optimally phrases the full problem as the optimal solution to the smaller pieces (subproblems). The overall problem can then be expressed as a composition of the subproblems. In addition to the WagnerFischer algorithm, numerous other dynamic programming algorithms have been developed for aligning biological sequences including the NeedlemanWunsch [22] and SmithWaterman Algorithms [23].
Sequence Alignment Algorithms
Dynamic programming algorithms are recursive algorithms modified to store intermediate results, which improves efficiency for certain problems. The SmithWaterman (NeedlemanWunsch) algorithm uses a dynamic programming algorithm to find the optimal local (global) alignment of two sequences  and . The alignment algorithm is based on finding the elements of a matrix where the element is the optimal score for aligning the sequence ( , . ) with ( , . ). Two similar amino acids (e.g. arginine and lysine) receive a high score, two dissimilar amino acids (e.g. arginine and glycine) receive a low score. The higher the score of a path through the matrix, the better the alignment. The matrix is found by progressively finding the matrix elements, starting at and proceeding in the directions of increasing and . Each element is set according to:
where is the similarity score of comparing amino acid to amino acid (obtained here from the BLOSUM40 similarity table) and is the penalty for a single gap. The matrix is initialized with . When obtaining the local SmithWaterman alignment, is modified:
The gap penalty can be modified, for instance, can be replaced by , where is the penalty for a single gap and is the number of consecutive gaps.
Once the optimal alignment score is found, the ``traceback'' through along the optimal path is found, which corresponds to the the optimal sequence alignment for the score. In the next set of exercises you will manually implement the NeedlemanWunsch alignment for a pair of short sequences, then perform global sequence alignments with a computer program developed by Anurag Sethi, which is based on the NeedlemanWunsch algorithm with an affine gap penalty, , where is the extension gap penalty. The output file will be in the GCG format, one of the two standard formats in bioinformatics for storing sequence information (the other standard format is FASTA).
Manually perform a NeedlemanWunsch alignment
In the first exercise you will test the SmithWaterman algorithm on a short sequence parts of hemoglobin (PDB code 1AOW ) and myoglobin 1 (PDB code 1AZI ).
 Here you will align the sequence HGSAQVKGHG to the sequence KTEAEMKASEDLKKHGT .
H  G  S  A  Q  V  K  G  H  G  
0  8  16  24  32  40  48  56  64  72  80 
K  8  
T  16  
E  24  
A  32  
E  40  
M  48  
K  56  
A  64  
S  72  
E  80  
D  88  
L  96  
K  104  
K  112  
H  120  
G  128  
T  136 
 The first step is to fill in the similarity scores S from looking up the matches in the BLOSUM40 table, shown here labeled with 1letter amino acid codes:
 We fill in the BLOSUM40 similarity scores for you in Table 4.
, using the convention that values appear in the top part of a square in large print, and values appear in the bottom part of a square in small print. Our gap penalty is 8.
Note: we consider to be the ``predecessor'' of , since it helped decided 's value. Later, predecessors will qualify to be on the traceback path.
Table 4: Alignment score worksheet. In all alignment boxes, the similarity score from the BLOSUM40 matrix lookup is supplied (small text, bottom of square). Four alignment scores are provided as examples (large text, top of square), try and calculate at least four more, following the direction provided in the text for calculating .
